Voice Capturing Basics¶

General Setup - Challenges of a Voice Frontend¶

../../_images/image2.svg — Fig. 1 Voice capturing challenges overview¶

An audio frontend - or voice frontend - needs to address several challenges when capturing voice:

Reverb is added by room reflections (depending on room acoustics)
Noise makes the voice less audible
Sounds played from the capturing device’s loudspeaker will echo from walls and will be fed back to the internal microphone
Talkers can be close to or further away from the microphone.

These challenges are addressed by individual building blocks, which are explained in the following chapters.

The following terminology is used throughout this document:

In communications, the near end refers to the user’s own side of the conversation, including their microphone, speech and local acoustic environment. The far end refers to the remote talker whose voice is received through the system, along with any echoes or noise originating from the remote side.

The recipient on the far-end side can be either a human or an ASR/Chatbot-System.

Differences between Humans and Machines (from an AFE perspective)¶

Voice frontends designed for humans and those designed for machines (such as ASR systems or audio classifiers) differ significantly in their preferred input characteristics.

Humans generally prefer a signal-to-noise ratio (SNR) of around +6 to +10 dB, with reasonably stable listening levels - variations of less than about 6 dB are comfortable - and a fast automatic gain control (AGC) is often appreciated. Some distortion is tolerated, also a small amount of musical tones and even quite a lot of reverberation. However, humans strongly dislike non-stationary noise and hearing their own echo.

Machines, by contrast, tend to perform well with SNRs from 0 +6 dB and accept a wide input range as long as the voice-activity-detection (VAD) threshold is reached. They may even benefit from reverb if the system has been trained on similar conditions. Machine-learning models can be trained to handle reverberation and distortion, but they are sensitive to certain artefacts.

Machine-learning systems generally perform poorly with:

Echo (which may trigger their own hotword or command) or jam the ASR pipeline
Musical tones (which disrupt Mel-spectrogram features)
AGC level changes (which interfere with adaptive model weights)

Levels¶

../../_images/image4.svg — Fig. 2 Sound pressure level reference¶

Moderate human voice level is typically defined as 60 dB at a distance of one metre (unweighted SPL, measured in the direction of the talkers’s mouth). When issuing a voice command, people usually speak slightly louder, at around 65-68 dB at one metre. As a rule of thumb, doubling the distance reduces the sound level by 6 dB, while halving the distance increases it by approximately 6 dB (more precisely: 20*log(x/x_ref)).

In terms of human listening preferences, ambient sound - such as background music - should be approximately 3 dB above the background noise level. Foreground audio or speech should be roughly 6-10 dB above the ambient (noise) level.

The distance between the signal of interest and the noise is referred to as signal-to-noise ratio (SNR). Positive SNR numbers describe a signal louder than noise, negative SNR indicates that the target signal is hidden in noise.

../../_images/image6.svg — Fig. 3 Signal-to-Noise Ratio (SNR) example¶

As a rule of thumb, the target SNR after all optimization and filtering from the audio frontend should be +10 dB or better for human-to-human conversation with good intelligibility or +6 dB for an automatic speech recognition system.

Example¶

In an assumed smart TV scenario:

Ambient Noise is 53 dB at 1 m (e.g. aircon) in a reverberant (diffuse) sound field
TV is playing at 65 dB at 1 m
The microphone is 12.5 cm away from the loudspeaker
User talks in an elevated voice 65 dB in 2 m distance from TV.

../../_images/image8.svg — Fig. 4 Target SNR requirements¶

To achieve +6 dB SNR, the audio-frontend needs to:

Reduce the signal fed back from the internal loudspeakers by at least 30 dB (a task for the echo canceller, AEC)
Gain some SNR against aircon noise with spatial filtering (beamforming) or noise reduction schemes.