Voice Processing Pipeline#

Overview and Key Features#

The XVF3800 integrates a set of advanced Digital Signal Processing (DSP) algorithms that include Acoustic Echo Cancellation (AEC), beamforming, dereverberation, noise suppression and automatic gain control. These advanced DSP algorithms deliver high speech-to-noise ratio, naturally sounding speech and eliminate acoustic echo while maintaining a transparent and low latency communication link.

The key features of the XVF3800 solution are:

  • High levels of Acoustic Echo Cancellation and Suppression in conferencing and living room conditions.

  • State of the art, robust, and natural double-talk / full-duplex performance.

  • High speech clarity level even when users are at several meters distance, without requiring directional microphones.

  • Fast adaptive beamforming for tracking multiple near-end users.

  • Stationary / diffuse noise suppression.

  • Automatic gain control.

Main Functional Blocks#

A high level diagram of the solution is shown below.

../../../../_images/XVF3800_voice_processing_pipeline_diagram.png

Fig. 4 Voice processing pipeline#

Microphone Inputs#

The XVF3800 captures voice signals through four digital microphones and converts them from Pulse Density Modulation (PDM) to Pulse Code Modulation (PCM). It passes the converted signals to the voice pipeline, along with the far-end signal that is played on the loudspeaker after having passed through a Digital to Analog Converter (DAC) and amplifier.

Acoustic Echo Canceller#

The first stage of the processing pipeline is the Acoustic Echo Canceller (AEC) which uses an adaptive filter to remove the echos of the far end signal from the microphone signals. Each of the microphone signals is processed independently and the output of the AEC is feed into the beamformer.

At startup the AEC calibrates the adaptive filters to match the acoustic path between the loudspeaker and the microphones. This requires some far end audio content to provide a signal to the device. If the AEC detects a significant change to the acoustic path during operation, e.g. if the device is moved, it will initiate a re-convergence operation.

Beamformer#

The beamformer block processes the AEC signals to select the desired speaker. The beamformer contains a set of adaptive filters that coherently add signals from the four microphones to select sounds from a specific direction. This operation enhances the speech to noise level in a specific direction and simultaneously reduces the effects of point noise sources and reverberation effects.

The XVF3800 implements three beams - one free running beam that scans the environment for new speakers, and two focused beams that can track individual speakers. The final stage of the pipeline automatically selects which beam to use as the output from the device.

../../../../_images/Beamformer.png

Fig. 5 Beamformer Operation#

It is possible to access information on the selected beams from the XVF3800 control interface. The device provides a Direction of Arrival (DoA) measurement indicating the direction of the selected beam.

Post Processor#

Outputs from the beamformer are fed to the post processing stage which further reduces reverberation and suppresses diffuse and point noise sources. This is followed by a gain control block which ensures a consistent output level regardless of the distance of the speaker from the microphone. The final output is passed through a limiter to ensure that any very loud signals do not overload the output.

The output from this pipeline is an enhanced speech signal of the desired near-end speech without echo and reverberation.

Input and Output#

The XVF3800 supports two types of audio interface to transport audio to and from the host system; I2S or USB. These are mutually exclusive and selected when the firmware image is built.

I2S Audio Interface#

The XVF3800 supports I2S sample rates of 16 kHz or 48 kHz. Both input and output must use the same rate.

The audio pipeline processes data with a sample rate of 16 kHz so, if 48 kHz inputs are used, a Sample Rate Converter block is introduced into the signal path to adapt the rates. The sample rates are set in the firmware and cannot be changed during operation of the device. The bit depth of the samples is fixed at 32 bits.

When used in UA mode (host audio over USB) the XVF3800 has an active I2S master output which provides the far end signal to the DAC.

USB Audio Interface#

The XVF3800 implements a standard UAC 2.0 audio class in Adaptive Mode which is compatible with USB hosts supporting USB 2.0 and 3.0 interfaces. All major operating systems now support USB Audio Class 2 (UAC 2) devices natively without the requirement to install additional audio drivers.

In the UA configuration the XVF3800 audio sample rate can be either 16 kHz or 48 kHz (fixed at build time) and must be the same as the output rate used for the DAC attached to the I2S output. The bit depths of the USB samples can be either 16, 24, or 32 (fixed at build time).

Reference Signal for AEC#

The XVF3800 supports a monophonic audio output and uses a single channel to provide the reference signal for the acoustic echo canceller (AEC). A far-end AEC reference signal must be provided on the left (0) channel of the I2S or USB input signal. Data on the right channel is ignored. In order to ensure the far end that is playing into the room matches the far-end that the AEC is expecting, the DAC is configured to play the left input channel on both the right and left outputs.

Key parameters#

The key operating parameters of the audio processing pipeline are shown in the table below.

Table 2 Pipeline paramenters#

Parameter

Value

Notes

Microphones

4 off PDM

e.g. Infineon IM69D130

Microphone alignment

+/- 2 dB

Geometry

Linear or Square

Frequency range

80 Hz to 8 kHz

Sampling rate

16 kHz

AEC tail length

192 ms

AEC reference channels

1 mono

Output to DAC

Double talk detection

Continous

Reference delay

0 to 500 ms (fixed)

Align microphone & reference signal

Number of beams

3

2 focused + 1 scanning

Beamformer angle

360 degrees

Noise suppression

up to 25 dB

depending on input SNR

Operating distance

0.3 m to 5 m

Beamformer update time

16 ms

Input delay

min 58 ms

Microphone In to I2S out

Output delay

typ 50 ms

If far end processing on device is implemented

I2S or USB rate

16 kHz or 48 kHz

Firmware options

I2S sample bit depth

32 bits

Input USB sample bit depth

16, 24 or 32 bits

Firmware options

Output USB sample bit depth

16, 24 or 32 bits

Firmware options

Internal PLL range

+/- 1000 ppm

Meets USB Adaptive audio tolerance

Additional information on the operation of the audio pipeline and the control interface can be found in the XVF3800 User Guide.