Voice Processing Pipeline#

Overview and Key Features#

The XVF3800 integrates a set of advanced Digital Signal Processing (DSP) algorithms that include Acoustic Echo Cancellation (AEC), beamforming, dereverberation, noise suppression and automatic gain control. These advanced DSP algorithms deliver high speech-to-noise ratio, naturally sounding speech and eliminate acoustic echo while maintaining a transparent and low latency communication link.

The key features of the XVF3800 solution are:

High levels of Acoustic Echo Cancellation and Suppression in conferencing and living room conditions.
State of the art, robust, and natural double-talk / full-duplex performance.
High speech clarity level even when users are at several meters distance, without requiring directional microphones.
Fast adaptive beamforming for tracking multiple near-end users.
Stationary / diffuse noise suppression.
Automatic gain control.

Main Functional Blocks#

A high level diagram of the solution is shown below.

../../../../_images/03_XVF38xx_voice_processing_pipeline_diagram.png — Fig. 4 Voice processing pipeline#

Microphone Inputs#

The XVF3800 captures voice signals through four digital microphones and converts them from Pulse Density Modulation (PDM) to Pulse Code Modulation (PCM). It passes the converted signals to the voice pipeline, along with the far-end signal that is played on the loudspeaker after having passed through a Digital to Analog Converter (DAC) and amplifier.

Acoustic Echo Canceller#

The first stage of the processing pipeline is the Acoustic Echo Canceller (AEC) which uses an adaptive filter to remove the echos of the far end signal from the microphone signals. Each of the microphone signals is processed independently and the output of the AEC is feed into the beamformer.

At startup the AEC calibrates the adaptive filters to match the acoustic path between the loudspeaker and the microphones. This requires some far end audio content to provide a signal to the device. If the AEC detects a significant change to the acoustic path during operation, e.g. if the device is moved, it will initiate a re-convergence operation.

Beamformer#

The beamformer block processes the AEC signals to select the desired speaker. The beamformer contains a set of adaptive filters that coherently add signals from the four microphones to select sounds from a specific direction. This operation enhances the speech to noise level in a specific direction and simultaneously reduces the effects of point noise sources and reverberation effects.

The XVF3800 implements three beams - one free running beam that scans the environment for new speakers, and two focused beams that can track individual speakers. An alternative operating mode is also supported in which both focused beams can be fixed to a user specified azimuth angle. The final stage of the pipeline automatically selects which beam to use as the output from the device.

../../../../_images/03_beamformer.png — Fig. 5 Beamformer Operation#

It is possible to access information on the selected beams from the XVF3800 control interface. The device provides a Direction of Arrival (DoA) measurement indicating the direction of the selected beam.

Post Processor#

Outputs from the beamformer are fed to the post processing stage which further reduces reverberation and suppresses diffuse and point noise sources. This is followed by a programmable equalization filter to adjust the frequency response of the output signal, and gain control block which ensures a consistent output level regardless of the distance of the speaker from the microphone. The final output is passed through a limiter to ensure that any very loud signals do not overload the output.

The output from this pipeline is an enhanced speech signal of the desired near-end speech without echo and reverberation.

Automatic Speech Recognition (ASR) output#

The output of the beamformer can be used as an input to an Automatic Speech Recognition (ASR) engine. In this mode the XVF3800 provides a configurable fixed gain to adapt the input level to the ASR engine.

Input and Output#

The XVF3800 supports two types of audio interface to transport audio to and from the host system; I²S or USB. These are mutually exclusive and selected when the firmware image is built.

I²S Audio Interface#

The XVF3800 supports I²S sample rates of 16 kHz or 48 kHz. Both input and output must use the same rate.

The audio pipeline processes data with a sample rate of 16 kHz so, if 48 kHz inputs are used, a Sample Rate Converter block is introduced into the signal path to adapt the rates. The sample rates are set in the firmware and cannot be changed during operation of the device. The bit depth of the samples is fixed at 32 bits.

When used in UA mode (host audio over USB) the XVF3800 has an active I²S master output which provides the far end signal to the DAC.

USB Audio Interface#

The XVF3800 implements a standard UAC 2.0 audio class in Adaptive Mode which is compatible with USB hosts supporting USB 2.0 and 3.0 interfaces. All major operating systems now support USB Audio Class 2 (UAC 2) devices natively without the requirement to install additional audio drivers.

In the UA configuration the XVF3800 audio sample rate can be either 16 kHz or 48 kHz (fixed at build time) and must be the same as the output rate used for the DAC attached to the I²S output. The bit depths of the USB samples can be either 16, 24, or 32 (fixed at build time).

Reference Signal for AEC#

The XVF3800 supports a monophonic audio output and uses a single channel to provide the reference signal for the acoustic echo canceller (AEC). A far-end AEC reference signal must be provided on the left (0) channel of the I²S or USB input signal. Data on the right channel is ignored. In order to ensure the far end that is playing into the room matches the far-end that the AEC is expecting, the DAC is configured to play the left input channel on both the right and left outputs.

Key parameters#

The key operating parameters of the audio processing pipeline are shown in the table below.

Table 2 Pipeline paramenters#
Parameter	Value	Notes

Microphones	4 off PDM	e.g. Infineon IM69D130
Microphone alignment	+/- 2 dB
Geometry	Linear or Square
Frequency range	80 Hz to 8 kHz
Sampling rate	16 kHz
AEC tail length	192 ms
AEC reference channels	1 mono	Output to DAC
Double talk detection	Continous
Reference delay	0 to 500 ms (fixed)	Align microphone & reference signal
Number of beams	3	2 focused + 1 scanning
Beamformer angle	360 degrees
Noise suppression	up to 25 dB	depending on input SNR
Operating distance	0.3 m to 5 m
Beamformer update time	16 ms
Input delay	min 58 ms	Microphone In to I²S out
Output delay	typ 50 ms	If far end processing on device is implemented
I²S or USB rate	16 kHz or 48 kHz	Firmware options
I²S sample bit depth	32 bits
Input USB sample bit depth	16, 24 or 32 bits	Firmware options
Output USB sample bit depth	16, 24 or 32 bits	Firmware options
Internal PLL range	+/- 1000 ppm	Meets USB Adaptive audio tolerance

Additional information on the operation of the audio pipeline and the control interface can be found in the XVF3800 User Guide.