Voice Capturing Basics

General Setup - Challenges of a Voice Frontend

../../_images/image2.svg

Fig. 1 Voice capturing challenges overview

An audio frontend - or voice frontend - needs to address several challenges when capturing voice:

  • Reverb is added by room reflections (depending on room acoustics)

  • Noise makes the voice less audible

  • Sounds played from the capturing device’s loudspeaker will echo from walls and will be fed back to the internal microphone

  • Talkers can be close to or further away from the microphone.

These challenges are addressed by individual building blocks, which are explained in the following chapters.

The following terminology is used throughout this document:

In communications, the near end refers to the user’s own side of the conversation, including their microphone, speech and local acoustic environment. The far end refers to the remote talker whose voice is received through the system, along with any echoes or noise originating from the remote side.

The recipient on the far-end side can be either a human or an ASR/Chatbot-System.

Differences between Humans and Machines (from an AFE perspective)

Voice frontends designed for humans and those designed for machines (such as ASR systems or audio classifiers) differ significantly in their preferred input characteristics.

Humans generally prefer a signal-to-noise ratio (SNR) of around +6 to +10 dB, with reasonably stable listening levels - variations of less than about 6 dB are comfortable - and a fast automatic gain control (AGC) is often appreciated. Some distortion is tolerated, also a small amount of musical tones and even quite a lot of reverberation. However, humans strongly dislike non-stationary noise and hearing their own echo.

Machines, by contrast, tend to perform well with SNRs from 0 +6 dB and accept a wide input range as long as the voice-activity-detection (VAD) threshold is reached. They may even benefit from reverb if the system has been trained on similar conditions. Machine-learning models can be trained to handle reverberation and distortion, but they are sensitive to certain artefacts.

Machine-learning systems generally perform poorly with:

  • Echo (which may trigger their own hotword or command) or jam the ASR pipeline

  • Musical tones (which disrupt Mel-spectrogram features)

  • AGC level changes (which interfere with adaptive model weights)

Levels

../../_images/image4.svg

Fig. 2 Sound pressure level reference

Moderate human voice level is typically defined as 60 dB at a distance of one metre (unweighted SPL, measured in the direction of the talkers’s mouth). When issuing a voice command, people usually speak slightly louder, at around 65-68 dB at one metre. As a rule of thumb, doubling the distance reduces the sound level by 6 dB, while halving the distance increases it by approximately 6 dB (more precisely: 20*log(x/x_ref)).

In terms of human listening preferences, ambient sound - such as background music - should be approximately 3 dB above the background noise level. Foreground audio or speech should be roughly 6-10 dB above the ambient (noise) level.

The distance between the signal of interest and the noise is referred to as signal-to-noise ratio (SNR). Positive SNR numbers describe a signal louder than noise, negative SNR indicates that the target signal is hidden in noise.

../../_images/image6.svg

Fig. 3 Signal-to-Noise Ratio (SNR) example

As a rule of thumb, the target SNR after all optimization and filtering from the audio frontend should be +10 dB or better for human-to-human conversation with good intelligibility or +6 dB for an automatic speech recognition system.

Example

In an assumed smart TV scenario:

  • Ambient Noise is 53 dB at 1 m (e.g. aircon) in a reverberant (diffuse) sound field

  • TV is playing at 65 dB at 1 m

  • The microphone is 12.5 cm away from the loudspeaker

  • User talks in an elevated voice 65 dB in 2 m distance from TV.

../../_images/image8.svg

Fig. 4 Target SNR requirements

To achieve +6 dB SNR, the audio-frontend needs to:

  • Reduce the signal fed back from the internal loudspeakers by at least 30 dB (a task for the echo canceller, AEC)

  • Gain some SNR against aircon noise with spatial filtering (beamforming) or noise reduction schemes.