Building Blocks of an Audio-Frontend¶
Building Blocks Overview¶
Fig. 5 Audio frontend building blocks overview¶
Acoustic Echo Cancellation (AEC)¶
An acoustic echo canceller (AEC) removes signal that is played through a device’s loudspeaker, into the room, and picked up again by its microphones. It models the acoustic path between loudspeaker and microphones, predicts the echo, and subtracts it from the captured signal in real time. By eliminating this feedback, the AEC ensures clear communication and prevents the far-end listener from hearing their own voice echoed back.
Fig. 6 Acoustic Echo Cancellation (AEC)¶
A non-working AEC causes:
Humans: When a user hears their own voice with a delay of more than 50 ms, the effect works like a speech jamming device.
Machines: ASR and Hotword-spotters will react to the echo from the same device.
Note: Any playback from the internal loudspeaker will be louder than a far-field talker, as the loudspeaker is closer to the microphone. Without AEC, any barge-in (Issuing a voice command while the system is playing sounds) will not be possible.
Fig. 7 Basic components of an AEC and direct / reflected echo paths¶
The amount of sound that is (re-) captured depends on the echo paths. It must be removed by an adaptive filter.
Direct coupling: The loudest echo-path is direct sound from the loudspeaker to the microphone via air and enclosure. This path will not change a lot during use
Room-Reflections (from tables, walls, windows) will add more echo that is feed back. These vary heavily depending on room, people, and needs to be re-measured several times per seconds as people and objects in the room move.
The adaptive filter can only be tuned, when there is a far-end signal, but no near-end voice. As soon as a near end voice is present, adaptation must be paused (Double-Talk-Detection).
AEC filters have a finite length in time, which limits how much reverberation time they can cancel. Any remaining echo residuals must therefore be cleaned up by a post-filter or echo suppressor, typically based on noise suppression or spectral subtraction techniques.
Please do not mix up the terms for Echo-Cancellation and Dereverberation.
Dereverberation¶
Fig. 8 In reverberation rooms several overlaying and delayed signals are captured.¶
The job of a dereverberation (dereverb) system is to remove room reflections from the incoming audio signal (e.g. the talkers voice). The total amount of reverberation is dependent on the room acoustics.
Reverberation reduces speech intelligibility for humans, while ASR systems can be trained with reverberant material and therefore made more robust against it, but this requires larger and more complex models. The basic strategy is to estimate the amplitude and timing of possible reverb tails and subtract them from the microphone input.
Dereverberation can be implemented using statistical models or modern DNN/AI approaches. Many AI-based signal-denoising algorithms now integrate dereverb and noise reduction into a single model. Ultimately, dereverb performance depends entirely on the acoustic reflections present in the user’s environment: no reflective room means no reverberation to remove.
Key points:
Strategy: estimate likely reverb tails and subtract them
Possible implementations: statistical models or AI-based models
AI denoising can include dereverb as part of one model
Dereverb depends solely on room reflections-no room, no reverb
Active Noise Control/Cancellation (ANC)¶
Fig. 9 ANC aims to cancel out noise with a matching anti-noise signal.¶
Active noise cancelling (ANC) is used to reduce unwanted ambient noise. The external noise is captured with microphones, generating an “anti-noise” signal with the same amplitude but opposite phase, and adding it to the playback so that the two signals cancel each other out.
Two basic requirements must be met to enable ANC:
The noise signal must be capturable or otherwise known.
The exact acoustic path from the point where the noise signal is captured to the listener’s ear must be predictable or known.
ANC is often used in applications where the loudspeaker is close to the listeners ear and the noise can be captured by an ‘outside’ microphone, like headphones, vehicles or appliances
The main caveats is latency: As the phase-inverted sound must be calculated fast enough to cancel out the noise, any added latency will limited effectiveness at high frequencies.
ANC is sensitive to fast-changing or highly irregular noise and a dependence on good acoustic design. Poor sealing or microphone placement can reduce cancellation quality or even cause artefacts such as added hiss or instability.
Beamforming and (Adaptive) Directivity¶
Fig. 10 Spatial separation directional beam patterns¶
Beamforming, also called spatial separation, is a technique that uses multiple microphones to focus on sound coming from a specific direction. Its job is to amplify the speaker’s voice while suppressing sounds arriving from other angles. By steering a directional “beam” toward the talker, it improves clarity for both humans and speech-recognition systems.
Fig. 11 Microphone array geometry¶
The geometric (physical) location of the microphones within the sound field define the possible directivity patterns and usable frequency range.
The most common microphone array geometries are:
Circular (or square) for 360° coverage, e.g. Smart Speaker
Linear (line), broadside (180°) coverage, e.g. TV or white-goods
Linear (line), endfire (very narrow angles possible), e.g. for security cameras.
Number of Microphones vs. Achievable Directivity¶
1: No Beamforming - omnidirectional
2: minimum for broadside or endfire - best achievable: cardioid directivity
4: minimum for true 360° coverage
3: poor compromise (not recommended)
More than 4: extends usable frequency range (smaller/wider distance) or sensitivity (better SNR)
Fixed Beamforming¶
Fixed beamforming uses a predetermined, static microphone weighting pattern (a “beam”) that always points in one or more predefined directions. It does not react to changing talker positions or noise sources. The same microphone array can be used to calculate different fixed beams.
Pros
Simple and robust: Works reliably in predictable environments.
Low computational cost: Requires minimal processing power.
Stable behaviour: No risk of the beam drifting toward noise sources e.g.
Good for controlled setups: Ideal for intercoms, appliances, or devices where talker position is known.
Cons
No adaptation: Performs poorly when the user moves or when noise sources change.
Limited flexibility: Not suitable for dynamic environments like meeting rooms.
Lower SNR improvement: Can’t optimise itself based on real-time acoustic conditions.
Adaptive Beamforming¶
Adaptive beamforming continuously adjusts its microphone weights to track the dominant speech source or to suppress noise sources. It uses different algorithms to determine the direction of arrival and derive coefficients for the beam control. Some adaptive beamforming algorithms can give out direction of arrival (DOA-Vectors).
Pros
Tracks moving voices: Excellent in dynamic environments.
Better noise suppression: Can null out new or unexpected noise sources.
Higher SNR: Typically outperforms fixed beamforming when conditions change.
Cons
Higher computational load: Needs more processing resources and memory.
Risk of wrong adaptation: May “lock onto” the wrong source (e.g. a loud fan).
Requires VAD: Needs reliable voice-activity detection to avoid adapting on noise.
More sensitive to microphone mismatch and calibration errors.
Conclusion¶
Fixed beamforming is simple, stable and low-power, making it ideal for situations where the talker’s position is known and does not change. Adaptive beamforming, by contrast, is more flexible, powerful and dynamic, as it can adjust to changing acoustic conditions and moving talkers. However, it is also more complex and may fail if its adaptation is misled by strong noise sources or incorrect voice cues.
Note: With 4 microphones in usual spacing (e.g. 60 mm), only a beam of 45-50° can be formed in the voice range. Super-directive beams cannot be formed with this configuration. The achievable SNR improvement in real-world setups and conditions may be considered to be 3-6 dB, as noise in reverberant rooms comes from all directions.
A similar technology which aims to suppress noise instead of amplifying is interference/side-lope cancelling.
Interference Canceller¶
An interference canceller is designed to remove unwanted sounds - such as background noise or competing talkers - by exploiting differences between multiple microphone signals.
It works by analysing the phase and amplitude relationships across two or more microphones, identifying which parts of the sound field are consistent with interference rather than the desired speech. Using these spatial differences, it constructs a (multistage) filter that subtracts or suppresses the unwanted components while allowing the true speech signal to pass. To determine the difference between target signal and noise, a good and robust voice activity detection is needed.
A key strength of an interference canceller is that it adapts very quickly, making it effective in dynamic or noisy environments where traditional beamforming might initially “point the wrong way.” Because it separates sound based on spatial and statistical cues rather than fixed directionality, it can capture speech from any direction while rapidly rejecting noise and competing sources.
Especially during the adaptation phase, the output of the interference canceller may sound ‘robotic’ and less natural compared with a beamformer in adaptation. This is generally acceptable for ASR or intercom tasks, but it may fall short of the quality expected in wideband communication setups.
Comparison between Interference Canceller and Adaptive Beamformer
Aspect |
Interference Canceller |
Adaptive Beamformer |
|---|---|---|
Speech Direction |
Any direction |
Best with known or tracked direction (DOA) |
Adaptation Speed |
Very fast |
Medium to slow |
Computational Cost |
Low/Medium |
Medium/High |
Dynamic/Moving Environments |
Excellent |
Can struggle |
Spatial Selectivity |
Moderate |
High |
SNR Improvement |
Moderate |
High |
VAD Dependency |
High |
Medium/High |
Sensitivity to Reverberation |
Low |
Medium/High |
Differential Pressure Microphone Array (DMA)¶
Fig. 12 Differential Pressure Microphone Array (1st order DMA). d is in the range of few millimetre. Depending on a1,1 and t1 any first order* sensitivity pattern from dipole to hyper-cardioid can be created.¶
A Differential Pressure Microphone Array (DMA) is a very compact microphone configuration designed to achieve strong directionality using only a small spacing between two microphones-typically just a few millimetres. Unlike conventional arrays that rely on larger distances to create spatial filters, a DMA measures the pressure difference between its two closely spaced elements. This makes it behave like a first-order differential microphone, producing cardioid or super-cardioid directivity even in extremely small devices.
Because the microphones are so close together, sounds arriving from the front reach both microphones almost simultaneously, producing a small pressure difference. Sounds coming from other directions create a different pressure relationship. By subtracting or combining the signals from the two capsules, the system forms a directional pattern that emphasises sound from the desired direction and suppresses sound from the sides or rear. This creates a super compact fixed beamformer.
The key properties of a DMA are:
Very small aperture (< 1.5 cm), Ideal for wearables, small cameras, doorbells or smart devices
Cardioid / super-cardioid patterns, directionality created from pressure differences
Low-frequency behaviour, DMA microphones naturally maintain directionality even at lower frequencies compared with conventional arrays of the same size. As the pressure differences for small arrays decrease at low frequencies, the needed make-up gain increases noise at low frequencies (white noise gain).
DMAs have the following advantages:
Extremely compact
Good directionality from a very small form factor
Works well in constrained industrial or consumer designs
Cost-effective and easy to integrate.
However, DMAs also have some limitations:
Sensitive to manufacturing tolerances (matching of capsules). As MEMS-microphones normally have very low tolerances compared to electret capsules (at least within one charge), this is not too critical when using digital MEMS microphones.
Not as flexible as multi-element beamforming arrays, but multi mic DMAs can be realised.
White-noise gain limits performance at low frequencies.
Denoising¶
Fig. 13 Noise cancellation work principle: Identify noise and remove it from the spectrum.¶
A noise canceller’s job is to reduce unwanted background sounds in an audio signal so that the desired speech becomes clearer and easier to process. It analyses the incoming audio, estimates the noise components and subtracts or suppresses them while preserving the speaker’s voice as much as possible. Effective noise cancellation improves intelligibility for humans and increases accuracy for automatic speech-recognition systems.
Most Noise cancelling schemes work after similar principles:
Separate the Audio in Frames (Slices, with some overlap)
For each frame, transform it to a spectral bins (frequency domain - FFT or Mel-Bins)
Find some nice algorithm, that can decide which bins are noise (and can be set to zero) and which should be preserved (speech mask)
Transform it back from spectrum to time domain
Append the audio frames to each other (with some overlap).
The most challenging task is to decide which information is noise and which should be preserved. Knowing what type of noise to expect can help.
As most denoising algorithms work in the frequency domain, DSPs with fast FFT and multiple cores (like XMOS) win here.
Noise Types¶
Fig. 14 Noise types overview¶
Depending on the noise type, different noise filtering strategies can be used. Classic denoising algorithms work well on stationary noises by adopting an (averaged) noise-mask in times where no speech signal is present. They can be implemented very efficiently on recent DSP hardware, giving a chance for a ‘basic’ noise reduction at low computational costs.
More challenging are impulsive noises - that cannot be averaged - or signals that are similar to the target voice signal - other talkers in the background (often referred to as babble- or pub noise) or sounds with a speech-like characteristic (e.g. saxophone players). Here, AI based algorithms can make the difference.
AI Denoising¶
Fig. 15 AI-based denoising architecture¶
AI- based denoising aims to estimate the best possible speech-noise mask so that the system can reliably separate voice from background noise. It uses single-channel, real-time trained neural networks (Deep Neural Networks (DNN), in form of a recursive (RNN or convolutional (CNN) neural network or combinations of both)., The network continuously computes masks that distinguish speech from noise. These models can be optimised for specific acoustic environments, and the DNN block size represents a trade-off between delay and output quality.
The aggressiveness of the algorithm can also be adjusted to balance noise suppression with natural-sounding voice quality.
Pros
Excellent noise-reduction performance: DNNs can outperform classical DSP methods, especially with complex, non-stationary noises (fans, traffic, speech babble).
Learns real-world noise patterns: Because models are trained on large datasets, they can generalise well to many acoustic environments.
Can combine multiple tasks: Modern architectures can perform denoising + dereverb + echo residual cleanup in a single model.
Adaptive behaviour: Unlike fixed DSP filters, DNNs can adjust dynamically to changing acoustic conditions.
Good voice preservation: Well-trained models suppress noise aggressively while keeping speech natural and intelligible.
Cons
Higher computational cost: DNN denoisers require more processing power, memory, and often hardware acceleration (XCore AI).
Additional delay (latency): Model block sizes and buffering introduce latency that may be unsuitable for real-time communication - real-time requires optimised DNN models.
Training data dependency: Performance strongly depends on training quality; unseen noise types may degrade results.
Risk of artifacts: Over-aggressive denoising can introduce musical tones, speech distortion, or “underwater” artefacts.
Less predictable than classical DSP: Models behave non-linearly and can be harder to tune, debug, or certify.
Comparison Table: AI Denoising vs. Spectral Subtraction vs. Wiener Filtering¶
Aspect |
AI Denoising |
Spectral Subtraction |
Wiener Filtering |
|---|---|---|---|
Noise Types Handled |
Excellent for stationary & non-stationary noise (speech babble, traffic, fans) |
Good for stationary noise, poor for non-stationary |
Good for stationary noise, moderate for slow-varying noise |
Speech Quality |
Very good when trained well; preserves naturalness |
Can cause musical artefacts, speech distortion |
Usually smoother than spectral subtraction, but still may distort speech |
Computational Cost |
High (requires CPU/DSP/AI accelerator) |
Low |
Moderate |
Latency |
Medium to high (depends on model size and block size) |
Very low |
Low to moderate |
Adaptability |
Learns from data; can adapt to many environments |
No adaptation; fixed algorithm |
Limited adaptation via noise tracking |
Handling Non-Stationary Noise |
Excellent |
Poor |
Moderate |
Artefacts |
Possible artefacts if model poorly trained (underwater sound) |
Musical noise common |
Less musical noise than spectral subtraction but can sound muffled |
Tunability |
Requires retraining or model tuning; less predictable |
Easy to tune thresholds and noise estimates |
Requires tuning of noise estimators, more stable than spectral subtraction |
Implementation Complexity |
High |
Very low |
Low to moderate |
Real-Time Suitability |
Yes, but needs powerful hardware |
Very suitable |
Suitable |
Overall Performance |
Best overall, especially in complex noise |
Worst overall, but simplest |
Balanced compromise (classic DSP solution) |
Speech/Voice-Activity Detection¶
Fig. 16 Voice Activity Detection (VAD)¶
A Speech Activity Detector (SAD), Voice Activity Detector (VAD), or Voice-to-Noise Ratio estimator (VNR) is responsible for determining whether the incoming audio signal contains human speech. In the case of a VNR, it estimates the speech to noise ratio.
Simple VAD approaches rely on straightforward metrics such as counting zero-crossings in the waveform, while more advanced methods analyse energy levels in specific voice-relevant frequency bands and compare them - often adaptively - to the rest of the spectrum. Modern DNN-based VAD or VNR systems use neural networks, which are particularly effective at identifying speech even under challenging acoustic conditions, essentially “finding the needle in the haystack.”
VAD and VNR are key enabling components for many audio-processing blocks, including acoustic echo cancellation (AEC), automatic gain control (AGC), denoisers, interference cancellers and beamforming, as it tells these algorithms when to adapt, react, or hold their state.
Automatic Gain Control (AGC)¶
Fig. 17 Automatic Gain Control (AGC)¶
An Automatic Gain Control (AGC) system adjusts the microphone signal level so that speech remains clear, consistent and easy to process, regardless of whether the talker is quiet, loud, close to the microphone or further away. It continuously monitors the input level and applies controlled amplification or attenuation to keep the output within an optimal range.
AGC is needed because real-world speech levels vary enormously, and downstream algorithms - such as ASR, DNN based classifiers and encoding/compression for transmission - perform best when the signal stays within a stable, expected amplitude range.
Without AGC:
Quiet voices may be lost in the noise floor or quantisation
Loud voices may clip or distort.
This will reduce both intelligibility and machine-recognition accuracy.
AGC should only adjust when voice is present, this requires a Voice/Speech Activity Detection. Adoptation time should be adjustable, changing gain quickly distorts audio.
Hint: Consider using an additional Compressor/Limiter as last line of defense to avoid (digital) clipping. Thus, the AGC adoption speed can be kept moderate, and loud signals will not completely override the signal processing chain.
Direction of Arrival (DOA)¶
A Direction of Arrival (DOA) algorithm estimates the angle or direction from which a sound source - typically a human talker - is reaching a microphone array. DOA is used to steer beamformers, improve spatial filtering, guide camera tracking, or support human-machine interfaces. It works by analysing the time differences, phase differences, or spatial energy distribution of the sound as it reaches multiple microphones.
Adaptive/steered beamforming frontends need to estimate the direction of arrival as a control parameter, and the beam-direction is often exposed to the user space which gives a DOA for free.
The most common DOA estimation schemes are outlined below.
Time Difference of Arrival (TDOA)¶
Uses the time delays between microphones to compute the direction.
Simple and robust for linear or circular arrays.
Often implemented using GCC-PHAT.
Steered Response Power (SRP / SRP-PHAT)¶
Scans many potential directions and finds the one with the strongest spatial energy.
Very robust in reverberant rooms.
Works with arbitrary microphone geometries.
Multiple Signal Classification (MUSIC)¶
High-resolution subspace method.
Excellent angular precision, even with multiple simultaneous sources.
Computationally expensive.
Eigenvalue / Beamspace Methods¶
Variants of MUSIC or ESPRIT adapted for specific array types or low-power hardware.
Good accuracy with reduced computation effort.
DNN-Based DOA Estimation¶
Uses neural networks trained on multi-mic features (inter-channel phase, energy).
Very robust to noise and reverberation.
Useful in complex or moving environments.
Computationally costly.
Loudspeaker Engine (Far-End-DSP)¶
Fig. 18 Speaker engine basics¶
A loudspeaker engine-often referred to as Far-End Signal Enhancement or Far-End DSP-is the processing block responsible for optimising the audio that is played through a device’s loudspeaker system. Its main goal is to ensure that the far-end audio (the voice coming from the remote talker) sounds clean, stable, and undistorted, while also providing the best possible reference signal for acoustic echo cancellation (AEC). By improving loudspeaker output quality and avoiding distortions, the loudspeaker engine enables both better user experience and reliable barge-in for ASR systems.
The ket functions and requirements of a loudspeaker engine are:
Provide everything needed to achieve the best possible sound quality from a given loudspeaker setup.
Prevent loudspeaker non-linearities and enclosure resonances that could degrade AEC performance by creating unpredictable/unknown echo components.
Because it enhances the far-end signal played through the loudspeaker, this block is also known as Far-End DSP. It shapes, stabilises, and controls the outgoing audio so that what the microphones pick up is predictable, allowing the AEC to cancel it effectively.
The key features of a loudspeaker engine include:
Low latency, down to only two samples from input to output, ensuring smooth communication and fast AEC adaptation.
Dynamic EQ and multi-band compression to prevent loudspeaker non-linearities and remove cabinet resonances. This results in cleaner AEC reference signals and enables higher playback levels, which is essential for reliable ASR barge-in even during loud far-end playback.
Support for smart amplifiers, enabling temperature, excursion, and distortion protection while pushing the loudspeaker to optimal performance.