Smart Microphones Revealed

Posted: 22 November 2016

With the explosive growth in IoT products, especially in smart homes, we've reached an inflection point in the way that we communicate with these embedded systems. We're all used to having "an app for that", but with tens or hundreds of smart devices predicted for the home and offices, that paradigm simply doesn't scale; we need to embrace a more "universal" interface – voice.

One of the key components of voice interfaces is the smart microphone, which combines voice capture from one or more microphones within the system into a stream. This stream can be processed to create a highly directional microphone using direction-of-arrival and beamforming DSP algorithms, while echo cancellation, noise suppression, gain control and a range of other audio DSP capabilities are used to isolate the target speaker from unwanted noise. The captured, processed voice samples are then securely transferred to an embedded applications processor for further processing or to an automated speech recognition system (ASR) in the Cloud.

What is a beamforming microphone?

The term "beamforming" refers to the process of extracting the sound from a specific part of the space. This is normally a plane or cone shaped segment of space, but could be more specific. The quality of the beam is described in how much of the signal outside the beam is dropped (directivity pattern), and how well it can be steered; i.e. changing the section of space that we listen for. Beamforming microphones focus on a speaker's voice by steering the directivity pattern towards the sound source.

How is voice-activity detection achieved?

Voice-Activity Detection (VAD) is typically a three-stage process: first, any noise present in the system may be removed using spectral subtraction. Then feature extraction (typically proprietary) is used to drive a classification function. This can be as simple as a presence/absence threshold, but is often a complex feedback system. For instance VAD can also be used to provide feedback on the nature of the speech i.e. voiced, unvoiced or sustained.

What DSP algorithms do smart microphones use?

In addition to beamforming and VAD, smart microphones can use a number of other DSP techniques to improve the voice quality.

Noise suppression can be used to identify and compensate periodically varying signals, usually via spectral subtraction.

De-reverb is the process for eliminating room reflections and the associated echoes. In smart microphones it can effectively bring the speakers voice 'closer' to the microphone.

Acoustic echo cancellation (AEC) is a signal processing technique to remove echo generated by local audio loop back within a device when a microphone picks up audio from an integrated loudspeaker(s). It can be implemented as full duplex where the microphone simultaneously picks up a voice and background audio signal and the audio is removed to just leave voice signal, or as half duplex where only the voice or audio signal is present at any one time.

How is latency minimized in XMOS?

Voice controllers are time-critical, with system latency directly affecting the user experience – voice interfaces must be fast, smooth and responsive. Within XMOS systems latency is minimized by removing all the architectural elements that introduce non-determinism into an embedded system: no caches, no interrupts, no complex bus structures, and most importantly no embedded operating system in XMOS solutions. XMOS delivers very, very low latency systems, which allows voice controllers to operate at a much finer time quanta than conventional embedded systems.

Are smart microphones always listening?

Always-on performance has driven the development of thin-client systems, where most of the heavyweight processing is done in the Cloud. The smart microphone will be based on a highly integrated solution, such as an XMOS processor, or multiple discreet low-power microphone interfaces, DSPs and microcontrollers or application processors, to achieve the same effect.

Always-on smart microphones use technologies such as wake-on-keyword to stay in a deep sleep mode until triggered. Once awake the microphone passes the captured voice signal to the ASR services for processing.

How will smart microphones evolve over the next five years?

As smart microphone use takes off, we'll see greater levels of integration as vendors attempt to reduce system cost and power consumption. The XMOS architecture is ideal for voice controller implementations, but there will also be a new drive towards ultra-low power controllers for smart microphones as the market matures. The battle between thin and thick-client models is unlikely to be resolved, which will result in additional performance being required in the voice controller to support increased local ASR processing requirements.

Mark Zuckerberg has been very forthright in his desire to create a digital assistant – "Jarvis" – akin to the omnipresent assistant in the "Iron Man" films; the ubiquity offered by voice interfaces will accelerate our progress towards such a vision. The ability to add context to voice queries – offered by the most sophisticated ASR systems – will enable us to rapidly search for goods and services, far outstripping the capabilities offered by current browser-based search. Similarly, the ability to control or program home automation or entertainment systems with a voice interface will usher in the era of "no-interface" devices.



Comment on this post via social media

« ISO 9001 Recertification SuccessXVF3000/3100 VocalFusion Explained »