Meeting the technical challenges of far-field voice control

Living through lockdown, we have fallen in love with voice technology all over again. A report in Voicebot.ai states that over half of voice assistant owners have increased their usage through lockdown and 40% plan to up their usage even after lockdown has lifted.

Last year (2019), Futuresource Consulting reported that “shipments of virtual assistants rose 25% year on year to 1.1 billion units”. The analyst house also forecast the market to exceed 2.5 billion shipments by 2023.

The question for the market is exactly how consumer interactions with these virtual assistants will evolve over time. In just a few decades, the way we interact with technology has already moved from clunky keypad experience, to the sophistication of the touchscreen. Now we’re heading into the landscape of voice, biometrics, haptics and other sensing technology.

With the COVID-19 pandemic heightening our awareness of touch and hygiene, a truly natural voice interface is not just an improvement in human–machine interaction — it’s arguably a necessity – and to open the door to that natural conversation, you need far-field voice (FFV).

Far-field voice around the home

FFV technology can be integrated with a huge array of devices. In the short term, there is an obvious benefit to using it to enhance the capabilities of smart speakers, but the real prize is in embedding FFV within the other devices around the home. This includes TVs, sound bars, set-top boxes and other smart devices.

Let’s take TV as an example. While near-field, push-to-talk (PTT) has been a gateway for a voice-enabled TV experience, it still requires the use of a physical remote. By embedding far-field voice control into the device, the user is free to enjoy a handsfree experience, calling up content on the TV from anywhere in the room – no more hunting for the remote or frustrating keypad entry and navigation.

Although huge progress has been made in FFV technology in the last 18 months, consumer adoption is still taking off and therefore the ecosystem is still growing. This has created a challenging environment for product and solution architects to realise the full potential that the technology can offer.

FFV is a complex technical challenge. Voice interfaces need intelligent algorithms, purpose built for modern living spaces, that are capable of analysing the acoustic landscape to identify and isolate a command from every other sound in the room. Add to that, the necessity to deliver these interfaces at exceptionally aggressive eBOM costs and the integration of FFV can seem daunting for designers

How to make far-field voice work?

Capturing a clear voice command from a distance isn’t easy. It requires some complex digital signal processing (or DSP). Accuracy of capture and clarity of command are critical. Our ears automatically tune out background noise to focus on and amplify the sound we want to hear. But a microphone captures the whole soundscape – including the unwanted noise such as conversation, traffic noise, appliances, air-conditioning, birdsong and dogs barking.

Fundamentally, for the success of FFV interfaces, purpose-designed algorithms are required to provide clarity in challenging acoustic environments, ‘cleaning up’ the voice signal for transmission to an Automatic Speech Recognition (ASR) engine. With the XVF3510, our latest FFV interface solution, we address the three dominant noise sources in the environment to ensure the highest capture and transmission quality.

The first noise source is the noise that generated from the device – for example if you’re talking to a smart speaker playing music or a smart-TV streaming a film. Our acoustic echo canceller (AEC) removes this audio stream by modeling the echo response and creating an estimate of the audio which is picked up by the microphone. This enables you to barge-in (cut-across) the music or audio that’s playing.

The second source of noise is point noise – or noise coming from a fixed point in the room, for example appliances or the kettle boiling. Our interference canceller ‘scans’ the soundscape of the room, and supresses static point noise sources in the surrounding space.

Finally, the XVF3510 accounts for background ambient noise, like an air conditioning unit or general chatter in a room. Here, our noise suppression algorithm reduces general background noise from the microphone input, creating a clear audio stream to pass to the speech recognition engine.

These three algorithmic blocks are tuned to work together. The output is then fed into an automatic gain control (AGC) which normalises and optimises the volume for the speech recognition engine.

In these complex audio systems, delays between the audio reference and the audio output can degrade performance. Our automatic delay estimator algorithm compensates for any delay in audio coming out of the system and ensures echo cancellation is optimised for reliable barge-in.

The future — far-field voice and artificial intelligence

As you can see, this is not an insignificant technical challenge. However, at XMOS not only have we delivered all of these capabilities in our XVF3510 platform, we have also designed a system that can deliver class-leading performance with only two microphones, which is critical to delivering FFV in an eBOM-efficient package.

And this is just the beginning. Recognising both the need for and potential of the multi-modal interactions of the future, we are already exploring ways to harness edge AI, voice and other sensors to transform the end-user experience of FFV and virtual assistants with presence and context awareness.

Although this remains a young market, the voice performance of interfaces is already simply table stakes. OEMs need ever-more capabilities in ever smaller packages at ever-lower eBOM costs. The focus for tech vendors has to be on value-add experiences, not purely on voice in isolation.

To find out more, watch our session from VOICE Global here.

SPEAK TO SALES