Voice recognition is everywhere, but accurate capture is the key

Posted: 20 May 2016

Voice user interfaces - you can scarcely avoid the current hype in the media as giants like Amazon, and Google jostle to exploit the explosion of possibilities that advancements in natural language technologies are providing. Today’s neural networks use algorithms to process language through ever-deeper layers of complexity. Machines can now understand the meaning and intent of spoken words with unprecedented levels of accuracy. This has sparked a revolution for the power of voice. Google have used their current I/O 2016 conference to announce it’s digital assistant Google Assistant, which will directly challenge Microsoft, Facebook and Apple bot hype - automated programmes you can interact with via natural language. They announced Google Home, which we predicted earlier this year with Google’s announcement to open up its cloud speech API with the intent to develop an ecosystem that would rival the Amazon Voice Services. The conference also has a dedicated session that will talk about "how to build a smart RasPi Bot with Cloud Vision and Speech API."

What else is going on?

So is it perhaps coincidence that although this is just a small part of Google’s conference, Amazon has also just revealed new voice remote capabilities for its Fire TV devices. Amazon has a head start on the market with its Amazon Voice Services and the proliferation of Amazon Echo and its virtual digital assistant, Alexa. The new capabilities announced this week will expand the devices’ voice recognition features so that by just using voice commands, users can search for content, launch apps and play videos or even read from the Kindle library with speech-to-text enabled books. Hot on the heels of the big players are companies like SoundHound who have integrated hands-free voice commands using its own natural language understanding software, Houndify.

Voice capture requires high quality hardware

Software algorithms have certainly played a huge part in unleashing this explosion of possibility in voice user interfaces, but what about the hardware? Key to the success of voice recognition technology is fundamentally the quality of the voice capture and the response to the command or question. Until this is 95% accurate, then voice won’t be widely adopted as an interface in everyday life. One way to provide high quality voice capture is to use a smart microphone comprising a steerable array of several MEMS microphones. These can be used to create a “beam” that listens in different directions. The smart microphone uses signal processing to remove noise, echo and reverberation before the voice signal is passed to the speech recognition engine. It can listen out for instructions and then focus in on the location of the voice. It does this by adjusting the phase of the incoming sounds and matching the different parts to each audio channel. As well as increasing the accuracy of the voice input, as other sounds are less likely to interfere, the smart microphone allows different voices to be separated not only by how they sound but also by their location in the room.

Of course, there is slightly more to it than just the voice capture as the data has to travel back through the rest of the system and over a broadband backhaul link back to the cloud as quickly as possible, but the idea is that the signal that is captured at the front end is clean. This concept allows the creation of a whole new class of highly effective, interactive IoT devices.

The voice interface revolution has started

We are seeing just the beginning of the voice interface revolution with the frenzied media coverage of the big companies. As an agnostic supplier of embedded solutions in the voice terminal space, we know that our development boards are being used to prototype all kinds of applications, and voice recognition could be just the tip of the iceberg as designers bring us even closer to the smart home.


Read more about real-time DSP on the xCORE-200 architecture

Comment on this post via social media

« Why voice user interfaces with beamforming microphones work better with xCORE-VOICEHow voice user interfaces and natural language programming will change the way we interface with the world around us »