XVF3000/3100 VocalFusion Explained
XMOS has recently launched the XVF3000 family of far-field voice capture and processing devices. Let's take a quick look of the devices, what they do and how you could use them to enable a voice interface in your next product development.
There are two members in the family today: XVF3000 and XVF3100. Both devices provide the same voice capture and acoustic processing, with the XVF3100 also integrating keyword recognition – more on that in a moment.
The devices are based on XMOS' powerful and flexible xCORE microcontroller architecture, running a firmware that implements the voice capture, extraction and processing DSP algorithms.
Testing, testing, 1 2 3
In operation, the device continually captures audio from an array of four omnidirectional digital microphones, via four pulse density modulation (PDM) inputs. Two microphone topologies are supported: circular and linear. With the microphones arranged in a circle, voice can be captured and isolated from any direction in the full 360° space around the microphone array; ideal, for example, for a 'smart speaker' located in the centre of a room. Alternatively, when arranged linearly, voice can be captured and isolated from a 180° arc; perfect, for example, in a wall mounted panel application.
Cleaning up with voice DSP
The devices run a series of DSP functions on the captured microphone signals to identify and isolate voice content:
- Acoustic echo cancellation (AEC) is applied to each microphone signal to remove the playback signal. (In this context 'echo' refers to the playback signal as heard by the microphones.)
- An adaptive beamformer identifies and isolates any voice content in the listening space by focussing the microphones on to the person speaking, so cutting out other noise and keeping their voice clear.
- Dynamic de-reverberation removes room echoes and noise suppression removes other general background noise, so cleaning up the voice signal to make it easier to understand.
- Finally, automatic gain control (AGC) can be used to maintain the extracted voice signal at a useful level, meaning the person listening at the other end through the communications channel can always hear the person speaking.
In use, these features enable full-duplex operation; you can talk-over the playback music in a speaker application, or 'barge-in' and talk over other parties in a conferencing application.
Keeping connected to the rest of the system
Uniquely, the VocalFusion devices can provide two voice output channels: a 'communications' output and a 'automatic speech recognition (ASR)' output. The communications output is optimised for the human ear, whereas the ASR output has less processing applied (specifically, less non-linear effects) and so is optimised for streaming to speech recognition engines. The devices can be configured to provide one output, both outputs simultaneously, or to dynamically switch between them. XVF3000 devices provide flexible connection options, with audio connectivity via High-Speed USB 2.0 (USB Audio Class 1 device) or an I2S interface, and control via the same USB or I2C.
The XVF3100 also includes Sensory's TrulyHandsfree keyword recognition - the industry's leading keyword recognition solution. Keyword recognition allows you to trigger an activity when a specific word/phrase is 'heard'. Often this activity is to stream the voice signal to a third-party speech recognition engine for subsequent processing and action. Here, knowing when to stop streaming the voice signal is also important. The XVF3100 therefore includes a voice activity detector (VAD) which can use used to indicate when the person has finished talking and so, in this example, stop streaming the voice signal.
Tell me more!
UPDATED: This article has been updated with a new pipeline image and information about AGC support, 25 June 2018.
Comment on this post via social media
« Smart Microphones RevealedWhat magic will the Setem team bring to XMOS voice interfaces? »