Capturing Non-Voice Audio¶

Microphone arrays used for capturing non-voice audio differ significantly from arrays optimised for speech and human-machine communication. Speech-focused systems are typically tuned for the 200 Hz-7 kHz range, rely on directionality for talker localisation and often incorporate beamforming, VAD, AEC and denoising algorithms specifically tailored to human vocal characteristics. In contrast, an array designed for ambient, environmental or machine-generated sounds must handle a much wider variety of signal types, spectral shapes, dynamics and event durations. This places different demands on the microphones, acoustic design and processing chain.

Non-voice sound can include mechanical vibrations, wildlife calls, animal footsteps, glass breaking, vehicle noise, structural impacts, rotating machinery signatures or even ultrasonic cues. Systems must therefore be engineered for high dynamic range, wide bandwidth and reliable capture, rather than naturalness of speech reproduction.

Basic Requirements for Non-Voice Audio Capture¶

Wider Bandwidth¶

While human speech sits largely between 100 Hz and 8 kHz, non-voice events may require:

Infrasound (below 20 Hz) for structural or geophysical monitoring
Ultrasound (above 20 kHz) for wildlife detection, machinery bearing analysis or air-leak identification
Broadband noise up to 48-96 kHz for advanced audio classification.

Thus the microphones must support extended frequency ranges and maintain consistent sensitivity across that spectrum.

High Dynamic Range and Sensitivity¶

Environmental and mechanical systems can produce:
Very quiet signals (insects, distant wildlife, leaf movement)
Very loud signals (impacts, machinery faults, alarms).

A high signal-to-noise ratio (SNR), low self-noise capsule and sufficient maximum SPL handling are essential. Microphones with low self-noise (<20 dBA) and over-load points >120 dB SPL are common in such applications.

Consider using ‘stacked’ microphones: summing the signal of several microphones placed with minimum distance will lower the (non-correlated) noise-floor by 3 dB per doubling of microphones.

Calibrated Response and Matching¶

Non-voice analysis often requires accurate spectral measurements and consistent array behaviour.

Therefore:

Microphones must be tightly matched in sensitivity and phase
The array geometry must be stable, rigid and reproducible
Enclosure resonances must be minimised (e.g., via stiffeners and ribs)

Matching microphone characteristics is crucial for localisation, event detection and reliable multi-mic processing. High-quality MEMS microphones are often manufactured to tight tolerance and a fully digital interface (PDM or I2S) can help to keep the variations low (without a piece-by-piece matching process)

Robustness and Environmental Resistance¶

Applications may require:

Weatherproofing (IP ratings)
Dust, humidity and temperature stability
Shock and vibration resistance
Long-term reliability in harsh outdoor settings.

MEMS microphones with protective acoustic mesh, waterproof membranes or specialised capsules may be used depending on conditions.

Flexible Signal Processing¶

The (pre-) processing is often very application specific because non-voice signals vary widely. The processing chain often includes:

Spectral analysis (FFT, MFCCs, spectral flux)
Direction of arrival (DOA) estimation
Advanced noise modelling.

Unlike speech-oriented systems, these arrays do not necessarily require VAD, AEC or voice-specific beamforming but may use wideband spatial filtering or energy-based localisation.

Event Detection Algorithms and AI-Based Classifiers¶

Event detection on embedded audio platforms involves identifying acoustic events-such as impacts, alarms, animal calls or machine anomalies - within a continuous audio stream.

Traditional DSP-based approaches rely on lightweight features such as energy thresholds, zero-crossing rates, spectral flux, or band-specific power, combined with state machines or statistical models (e.g., Gaussian mixture models (GMM)). These algorithms are computationally inexpensive and suitable for low-power hardware, but they may struggle with complex or highly variable acoustic environments.

Recent machine-learning based classifiers extend these capabilities by learning the characteristic patterns of events directly from data. On embedded systems, this typically involves compact models such as small convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformers with reduced parameter count, or classical ML methods like SVMs or random forests using MFCCs or log-mel spectrograms. These models can detect subtle or non-linear patterns that DSP rules cannot capture and can be trained for a wide range of tasks, from wildlife monitoring to predictive maintenance.

To meet embedded constraints, models must be quantised, pruned, or knowledge-distilled to run within tight limits on memory, latency and power. Combined with efficient feature extraction pipelines and hardware accelerators (VPU on XCORE), modern embedded ML enables reliable real-time event detection even in remote-operated low-power devices.