Low Power Listening Concepts¶

Low-power voice-activated systems often use a staged processing approach to balance energy efficiency, responsiveness, and recognition accuracy. The idea is to keep power consumption very low during idle periods while still maintaining fast activation when speech or a wake word is detected.

Every stage must be designed carefully to spend the least amount possible in the higher power modes.

../../_images/image42.svg — Fig. 22 Four (possible) steps of waking up a system from low power listen to full operation¶

1st Stage: Smart Microphone with Basic Level Detector or VAD¶

The first stage typically operates at ultra-low power, often implemented directly inside a smart microphone module. It performs only very simple tasks:

Monitoring sound level thresholds
Detecting basic activity such as speech-like patterns
Optionally performing a primitive VAD (Voice Activity Detection).

Its goal is not to recognise words but simply to determine whether something interesting enough to proceed to the next stage is happening in the acoustic scene.

This stage filters out silence and background noise while keeping the rest of the system asleep. Because it consumes only microwatts to milliwatts, it significantly extends battery life in portable or always-on devices.

For wake-up, only one microphone within an array needs to be capable of VAD/level detection. As microphones within a composed array should be of the same type (and in best case, similar production lot), it may be indicated to use either an additional smart microphone (and a regular array that is woken up with the DSP) or use identical smart microphones for all array positions.

2nd Stage: XCORE in Low-Power Mode Running VAD¶

Once the microphone indicates potential speech activity, the XCORE device wakes into a low-power monitoring mode. At this stage, the system performs a more accurate VAD pass.

This stage confirms whether the detected activity actually resembles human speech, ruling out impulsive noise, mechanical clicks, pets, or background ambience.

Key functions:

Robust Voice Activity Detection
Optional DOA (Direction of Arrival) to confirm talker location
Noise gating or early filtering.

This stage reduces false triggers and avoids waking the full system unnecessarily.

The transition from second to third stage is just changing the internal processor configuration:

Increase CPU clock
Enable second core / crossbar switches
Enable IO-Tiles
Power up external flash memory and start reading AI/ASR-model.

3rd Stage: XCORE in Full-Power Mode Running Automatic Speech Recognition¶

Once confident speech is present, the XCORE ramps up to full processing power. It now runs wake-word detection or lightweight on-device ASR.

Typical tasks include:

Detecting hotwords (“Hey Device”, other application-specific triggers)
Running small DNN models for command recognition
Applying beamforming, denoising or echo suppression to clean the audio

passing higher-quality features or audio frames to downstream processors

This stage is where true understanding begins. If the system is a voice-control system with all processing on XCORE, control logic and actuator control will be run directly from this stage.

4th Stage (Optional): Host SoC Wakes Up for Advanced Tasks¶

If the hotword or command is validated, in bigger systems the main SoC or application processor is woken from a low-power or sleep state. This stage handles tasks that require more memory, processing, or connectivity, such as:

Full large-vocabulary ASR
Natural language understanding (NLU)
Cloud communication or networking
Device control logic
UI interaction, display activation or actuator control.

Hints¶

There are several manufacturers of MEMS microphones with integrated circuitry that provide level- or speech-activity detection, based on different concepts, with power consumption in the low microampere range. It should be noted that, for the lowest-power stage, all PMIC circuitry powering the DSP and/or host must also be shut down or kept in a low-power state. In most cases, the power required to keep the rest of the system in standby is an order of magnitude higher than that consumed by the microphone itself.
Once the DSP is running, the incoming audio needs to be buffered so that hotwords can be verified and syllable onsets are not cut off in later ASR stages. The overall system design should provide a mechanism to read these samples from the start-up buffer into the main application, for example via a serial bus command.