The XCORE-VOICE Solution consists of example designs and a C-based SDK for the development of audio front-end applications to support far-field voice use cases on the xcore.ai family of chips (XU316). The XCORE-VOICE examples are currently based on FreeRTOS or bare-metal, leveraging the flexibility of the xcore.ai platform and providing designers with a familiar environment to customize and develop products.
XCORE-VOICE example designs include turn-key solutions to enable easier product development for smart home applications such as light switches, thermostats, and home appliances. xcore.ai’s unique architecture providing powerful signal processing and accelerated AI capabilities combined with the XCORE-VOICE framework allows designers to incorporate keyword, event detection, or advanced local dictionary support to create a complete voice interface solution. Bridging designs including PDM microphone to host aggregation are also included showcasing the use of xcore.ai as an interfacing and bridging solution for deployment in existing systems.
The C SDK is composed of the following components:
Peripheral IO libraries including; UART, I2C, I2S, SPI, QSPI, PDM microphones, and USB. These libraries support bare-metal and RTOS application development.
Libraries core to DSP applications, including vectorized math and voice processing DSP. These libraries support bare-metal and RTOS application development.
Libraries for speech recognition applications. These libraries support bare-metal and RTOS application development.
Libraries that enable multi-core FreeRTOS development on xcore including a wide array of RTOS drivers and middleware.
Pre-build and validated audio processing pipelines.
Code Examples - Examples showing a variety of xcore features based on bare-metal and FreeRTOS programming.
Documentation - Tutorials, references and API guides.
The XCORE-VOICE Solution takes advantage of the flexible software-defined xcore-ai architecture to support numerous far-field voice use cases through the available example designs and the ability to construct user-defined audio pipeline from the SW components and libraries in the C-based SDK.
These include:
Voice Processing components
Two PDM microphone interfaces
Digital signal processing pipeline
Full duplex, stereo, Acoustic Echo Cancellation (AEC)
Reference audio via I2S with automatic bulk delay insertion
Point noise suppression via interference canceller
Switchable stationary noise suppressor
Programmable Automatic Gain Control (AGC)
Flexible audio output routing and filtering
Support for Sensory, Cyberon or other 3rd party Automatic Speech Recognition (ASR) software
Device Interface components
Full speed USB2.0 compliant device supporting USB Audio Class (UAC) 2.0
Flexible Peripheral Interfaces
Programmable digital general-purpose inputs and outputs
Example Designs utilizing above components
Far-Field Voice Local Command
Low Power Far-Field Voice Local Command
Far-Field Voice Assistance
Firmware Management
Boot from QSPI Flash
Default firmware image for power-on operation
Option to boot from a local host processor via SPI
Device Firmware Update (DFU) via USB or I2C
Power Consumption
FFD/FFVA: 300-350mW (Typical)
Low Power FFD: 110mW (Full-Power), 54mW (Low-Power), <50mW possible with Sensory’s LPSD under certain conditions.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Obtaining the Hardware£££doc/quick_start_guide/01_introduction.html#obtaining-the-hardware
The XK-VOICE-L71 DevKit and Hardware Manual can be obtained from the XK-VOICE-L71 product information page.
The XCORE-AI-EXPLORER DevKit and Hardware Manual used in the Microphone Aggregation example can be obtained from the XK-VOICE-L71 product information page.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Obtaining the Software£££doc/quick_start_guide/01_introduction.html#obtaining-the-software
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Obtaining the Software$$$Development Tools£££doc/quick_start_guide/01_introduction.html#development-tools
It is recommended that you download and install the latest release of the XTC Tools. XTC Tools 15.3.1 or newer are required. If you already have the XTC Toolchain installed, you can check the version with the following command:
xcc --version
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Obtaining the Software$$$Application Demonstrations£££doc/quick_start_guide/01_introduction.html#application-demonstrations
If you only want to run the example designs, pre-built firmware and other software can be downloaded from the XCORE-VOICE product information page.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Obtaining the Software$$$Source Code£££doc/quick_start_guide/01_introduction.html#source-code
If you wish to modify the example designs, a zip archive of all source code can be downloaded from the XCORE-VOICE product information page.
If you have previously cloned the repository or downloaded a zip file of source code, the following commands can be used to update and fetch the submodules:
This is the far-field voice local command (FFD) example design. Two examples are provided: both examples include speech recognition and a local dictionary. One example uses the Sensory TrulyHandsfree™ (THF) libraries, and the other one uses the Cyberon DSPotter™ libraries.
When a wakeword phrase is detected followed by a command phrase, the application will output an audio response and a discrete message over I2C and UART.
Sensory’s THF and Cyberon’s DSpotter™ libraries ship with an expiring development license. The Sensory one will suspend recognition after 11.4 hours or 107 recognition events, and the Cyberon one will suspend recognition after 100 recognition events. After the maximum number of recognitions is reached, a device reset is required to resume normal operation. To perform a reset, either power cycle the device or press the SW2 button.
Production software runs on a special device. Contact Cyberon, Sensory or XMOS sales for information about production use of the device.
Requirements
XK-VOICE-L71 board
Powered speaker(s) with 3.5mm jack connection (OPTIONAL)
Speak one of the wakewords followed by one of the commands from the lists below.
There are three LED states:
Flashing Green = Waiting for Wake Word
Solid Red & Green = Waiting for or Processing Command
Fast Flashing Red = Evaluation period has expired
The application resets waiting for the wakeword (flashing green). Upon recognizing ‘Hello XMOS’ or ‘Hello Cyberon’ (DSpotter™ model only), waiting begins for a command (solid red & green).
After a period of inactivity, or successful command processing the application returns to waiting for wakeword (flashing green).
Sensory TrulyHandsfree™ and Cyberon DSpotter™ models detect the same commands, as listed below.
Wakewords
Hello XMOS
Hello Cyberon (DSpotter™ model only)
Dictionary Commands
Switch on the TV
Switch off the TV
Channel up
Channel down
Volume up
Volume down
Switch on the lights
Switch off the lights
Brightness up
Brightness down
Switch on the fan
Switch off the fan
Speed up the fan
Slow down the fan
Set higher temperature
Set lower temperature
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command£££doc/quick_start_guide/02.2_low_power_ffd.html#low-power-far-field-voice-local-command
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Overview£££doc/quick_start_guide/02.2_low_power_ffd.html#overview
This is the XCORE-VOICE low power far-field local control example designs demonstrating:
Low power control/handling
Small wake word model in SRAM
2-microphone far-field voice control with I2C or UART interface
Audio pipeline including interference cancelling and noise suppression
16-phrase English language speech recognition
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Example designs£££doc/quick_start_guide/02.2_low_power_ffd.html#example-designs
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Example designs$$$Demonstration£££doc/quick_start_guide/02.2_low_power_ffd.html#demonstration
The low power far-field voice local command (Low Power FFD) example design targets low power
speech recognition using Sensory’s TrulyHandsfree™ (THF) speech recognition and local dictionary.
When the small wake word model running on tile 1 recognizes a wake word utterance, the device
transitions to full power mode where tile 0’s command model begins receiving audio samples,
continuing the command recognition process. On command recognition, the application outputs a
discrete message over I2C and UART.
Sensory’s THF software ships with an expiring development license. It will suspend recognition
after 11.4 hours or 107 recognition events; after which, a device reset is required to resume
normal operation. To perform a reset, either power cycle the device or press the SW2 button.
Note that SW2 is only functional while in full power mode (this application is configured to hold
the device in full-power mode on such license expiration events).
Required Hardware
XK-VOICE-L71 board
XTAG4 debug adapter
2x USB-Micro B cables
Host computer for programming
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Example designs$$$Hardware Setup£££doc/quick_start_guide/02.2_low_power_ffd.html#hardware-setup
This example design requires an XTAG4 and XK-VOICE-L71 board.
Connect the XTAG4 to the debug header, as shown below.
Connect the both USB Micro-B connections on the XTAG4 and XK-VOICE-L71 to the programming host computer.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Example designs$$$Running the Demonstration£££doc/quick_start_guide/02.2_low_power_ffd.html#running-the-demonstration
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Example designs$$$Flashing the Firmware£££doc/quick_start_guide/02.2_low_power_ffd.html#flashing-the-firmware
Connect the XTAG4 via USB to the host computer running the XTC tools, and power on the board directly via USB.
On the host computer, open a XTCToolsCommandPrompt.
Being returned to the prompt means flashing has completed, and the XTAG4 may be disconnected.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Quick Start Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Example designs$$$Speech Recognition£££doc/quick_start_guide/02.2_low_power_ffd.html#speech-recognition
Speak one of the wake words followed by one of the commands from the lists below.
There are four LED states:
Solid Red = Low Power. Waiting for wake word.
Blinking Green = Full power. Waiting for command.
Solid Red & Green = Full power. Processing command.
Flickering Red = Full power. End of evaluation (device reset required).
On startup, the application enters low power mode and waits for the wake word. Upon wake word
recognition, the device enters full power mode and waits for a command. Upon command recognition,
the device will queue the command for processing. On each wake word or command recognition, a timer
is reset (per tile). On expiration of the intent engine’s timer, the device will request a transition
to low power. The other tile may reject the request in cases where its timer has not expired or other
application-specific reasons.
These are the XCORE-VOICE far-field voice assistant example designs demonstrating:
2-microphone far-field voice assistant front-end
Audio pipeline including echo cancelation, interference cancelling and noise suppression
Stereo reference input and voice assistant output each supported as I2S or USB (UAC2.0)
This application can be used out of the box as a voice processor solution, or extended to run local wakeword engines.
These applications features a full duplex acoustic echo cancellation stage, which can be provided reference audio via I2S or USB audio. An audio output ASR stream is also available via I2S or USB audio.
Connect either end of the ribbon cable to the XTAG4, and the other end to the XK-VOICE-L71 board as shown (Image shows piggybacked connection to RPi. Standalone operation is also supported):
Open a music player on host PC, and play a stereo file.
Check music is playing through powered speakers.
Adjust volume using music player or speakers.
Open Audacity and configure to communicate with kit. Input Device: XCORE-VOICE Voice Processor and Output Device: XCORE-VOICE Voice Processor
Set recording channels to 2 (Stereo) in Device
Set Project Rate to 48000Hz in Selection Toolbar.
Click Record (press ‘r’) to start capturing audio streamed from the XCORE-VOICE device.
Talk over music; move around the room while talking.
Stop music player.
Click Stop (press space) to stop recording. Audacity records single audio channel streamed from the XCORE-VOICE kit including extracted voice signal.
Click dropdown menu next to Audio Track, and select Split Stereo To Mono.
Click Solo on left channel of split processed audio. Increase Gain slider if necessary.
Click Play (press space) to playback processed audio.
Only your voice is audible. Playback music is removed by acoustic echo cancellation; voice is isolated by interference canceller; background noise is removed by noise suppression algorithms.
Copyright (c) 2017 Amazon.com, Inc., licensed under the MIT License
Sensory TrulyHandsfree™
The Sensory TrulyHandsfree™ speech recognition library is Copyright (C) 1995-2022 Sensory Inc. and is provided as an expiring development license. Commercial licensing is granted by Sensory Inc.
Cyberon DSpotter™
For any licensing questions about Cyberon DSpotter™ speech recognition library please contact Cyberon Corporation.
The XCORE-VOICE Solution consists of example designs and a C-based SDK for the development of audio front-end applications to support far-field voice use cases on the xcore.ai family of chips (XU316). The XCORE-VOICE examples are currently based on FreeRTOS or bare-metal, leveraging the flexibility of the xcore.ai platform and providing designers with a familiar environment to customize and develop products.
XCORE-VOICE example designs include turn-key solutions to enable easier product development for smart home applications such as light switches, thermostats, and home appliances. xcore.ai’s unique architecture providing powerful signal processing and accelerated AI capabilities combined with the XCORE-VOICE framework allows designers to incorporate keyword, event detection, or advanced local dictionary support to create a complete voice interface solution. Bridging designs including PDM microphone to host aggregation are also included showcasing the use of xcore.ai as an interfacing and bridging solution for deployment in existing systems.
The C SDK is composed of the following components:
Peripheral IO libraries including; UART, I2C, I2S, SPI, QSPI, PDM microphones, and USB. These libraries support bare-metal and RTOS application development.
Libraries core to DSP applications, including vectorized math and voice processing DSP. These libraries support bare-metal and RTOS application development.
Libraries for speech recognition applications. These libraries support bare-metal and RTOS application development.
Libraries that enable multi-core FreeRTOS development on xcore including a wide array of RTOS drivers and middleware.
Pre-build and validated audio processing pipelines.
Code Examples - Examples showing a variety of xcore features based on bare-metal and FreeRTOS programming.
Documentation - Tutorials, references and API guides.
The XCORE-VOICE Solution takes advantage of the flexible software-defined xcore-ai architecture to support numerous far-field voice use cases through the available example designs and the ability to construct user-defined audio pipeline from the SW components and libraries in the C-based SDK.
These include:
Voice Processing components
Two PDM microphone interfaces
Digital signal processing pipeline
Full duplex, stereo, Acoustic Echo Cancellation (AEC)
Reference audio via I2S with automatic bulk delay insertion
Point noise suppression via interference canceller
Switchable stationary noise suppressor
Programmable Automatic Gain Control (AGC)
Flexible audio output routing and filtering
Support for Sensory, Cyberon or other 3rd party Automatic Speech Recognition (ASR) software
Device Interface components
Full speed USB2.0 compliant device supporting USB Audio Class (UAC) 2.0
Flexible Peripheral Interfaces
Programmable digital general-purpose inputs and outputs
Example Designs utilizing above components
Far-Field Voice Local Command
Low Power Far-Field Voice Local Command
Far-Field Voice Assistance
Firmware Management
Boot from QSPI Flash
Default firmware image for power-on operation
Option to boot from a local host processor via SPI
Device Firmware Update (DFU) via USB or I2C
Power Consumption
FFD/FFVA: 300-350mW (Typical)
Low Power FFD: 110mW (Full-Power), 54mW (Low-Power), <50mW possible with Sensory’s LPSD under certain conditions.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Obtaining the Hardware£££doc/programming_guide/01_introduction.html#obtaining-the-hardware
The XK-VOICE-L71 DevKit and Hardware Manual can be obtained from the XK-VOICE-L71 product information page.
The XCORE-AI-EXPLORER DevKit and Hardware Manual used in the Microphone Aggregation example can be obtained from the XK-VOICE-L71 product information page.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Obtaining the Software£££doc/programming_guide/01_introduction.html#obtaining-the-software
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Obtaining the Software$$$Development Tools£££doc/programming_guide/01_introduction.html#development-tools
It is recommended that you download and install the latest release of the XTC Tools. XTC Tools 15.3.1 or newer are required. If you already have the XTC Toolchain installed, you can check the version with the following command:
xcc --version
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Obtaining the Software$$$Application Demonstrations£££doc/programming_guide/01_introduction.html#application-demonstrations
If you only want to run the example designs, pre-built firmware and other software can be downloaded from the XCORE-VOICE product information page.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Obtaining the Software$$$Source Code£££doc/programming_guide/01_introduction.html#source-code
If you wish to modify the example designs, a zip archive of all source code can be downloaded from the XCORE-VOICE product information page.
If you have previously cloned the repository or downloaded a zip file of source code, the following commands can be used to update and fetch the submodules:
It is recommended that you download and install the latest release of the XTC Tools. XTC Tools 15.3.1 or newer are required for building, running, flashing and debugging the example applications.
CMake 3.21 or newer and Git are also required for building the example applications.
A standard C/C++ compiler is required to build applications for the host PC. Windows users may use Build Tools for Visual Studio command-line interface.
It is recommended to use Ninja as the build system for native Windows firmware builds.
To install Ninja follow install instructions at https://ninja-build.org/ or on Windows
install with winget by running the following commands in PowerShell:
# InstallwingetinstallNinja-build.ninja# Reload user Path$env:Path=[System.Environment]::GetEnvironmentVariable("Path","User")
XCORE-VOICE host builds should also work using other Windows GNU development environments like GNU Make, MinGW or Cygwin.
This is the XCORE-VOICE automated speech recognition (ASR) porting example design. This example can be used by 3rd-party ASR developers and ISVs to port their ASR library to xcore.ai.
The example reads a 1 channel, 16-bit, 16kHz wav file, slices it up into bricks, and calls the ASR library with each brick. The default brick length is 240 samples but this is configurable. ASR ports that implement the public API defined in modules/asr/asr.h can easily be added to current and future XCORE-VOICE example designs that support speech recognition.
An oversimplified ASR port example is provided. This ASR port recognizes the “Hello XMOS” keyword if any acoustic activity is observed in 75 consecutive bricks.
Connect the xTAG to the debug header, as shown below.
Connect the micro USB XTAG4 and micro USB XK-VOICE-L71 to the programming host.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Linux or macOS£££doc/programming_guide/asr/deploying/linux_macos.html#deploying-the-firmware-with-linux-or-macos
This document explains how to deploy the software using CMake and Make.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Linux or macOS$$$Building the Host Server£££doc/programming_guide/asr/deploying/linux_macos.html#building-the-host-server
This application requires a host application to serve files to the device. The served file must be named test.wav. This filename is defined in src/app_conf.h.
Run the following commands in the root folder to build the host application using your native Toolchain:
Note
Permissions may be required to install the host applications.
The host application, xscope_host_endpoint, will be installed at /opt/xmos/bin, and may be moved if desired. You may wish to add this directory to your PATH variable.
Before running the host application, you may need to add the location of xscope_endpoint.so to your LD_LIBRARY_PATH environment variable. This environment variable will be set if you run the host application in the XTC Tools command-line environment. For more information see Configuring the command-line environment.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Linux or macOS$$$Building the Firmware£££doc/programming_guide/asr/deploying/linux_macos.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Linux or macOS$$$Flashing the Model£££doc/programming_guide/asr/deploying/linux_macos.html#flashing-the-model
The model file is part of the data partition file. The data partition file includes a file used to calibrate the flash followed by the model.
Run the following commands in the build folder to create the data partition:
make make_data_partition_example_asr
Then run the following commands in the build folder to flash the data partition:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Linux or macOS$$$Running the Firmware£££doc/programming_guide/asr/deploying/linux_macos.html#running-the-firmware
In a second console, run the following command in the examples/speech_recognition folder to run the host server:
xscope_host_endpoint 12345
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Native Windows£££doc/programming_guide/asr/deploying/native_windows.html#deploying-the-firmware-with-native-windows
This document explains how to deploy the software using CMake and Ninja. If you are not using native Windows MSVC build tools and instead using a Linux emulation tool, refer to Deploying the Firmware with Linux or macOS.
To install Ninja follow install instructions at https://ninja-build.org/ or on Windows
install with winget by running the following commands in PowerShell:
# InstallwingetinstallNinja-build.ninja# Reload user Path$env:Path=[System.Environment]::GetEnvironmentVariable("Path","User")
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Native Windows$$$Building the Host Server£££doc/programming_guide/asr/deploying/native_windows.html#building-the-host-server
This application requires a host application to serve files to the device. The served file must be named test.wav. This filename is defined in src/app_conf.h.
Run the following commands in the root folder to build the host application using your native Toolchain:
Note
Permissions may be required to install the host applications.
Note
A C/C++ compiler, such as Visual Studio or MinGW, must be included in the path.
Before building the host application, you will need to add the path to the XTC Tools to your environment.
The host application, xscope_host_endpoint.exe, will install at <USERPROFILE>\.xmos\bin, and may be moved if desired. You may wish to add this directory to your PATH variable.
Before running the host application, you may need to add the location of xscope_endpoint.dll to your PATH. This environment variable will be set if you run the host application in the XTC Tools command-line environment. For more information see Configuring the command-line environment.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Deploying the Firmware with Native Windows$$$Building the Firmware£££doc/programming_guide/asr/deploying/native_windows.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Modifying the Software$$$Implementing the ASR API£££doc/programming_guide/asr/modifying.html#implementing-the-asr-api
Begin your ASR port by creating a new folder under modules/asr/. The asr.h and device_memory.h files include comments detailing the public API methods and parameters. ASR ports that implement the public API defined can easily be added to current and future XCORE-VOICE example designs that support speech recognition.
Pay close attention to the functions:
- asr_printf
- devmem_malloc
- devmem_free
- devmem_read_ext
- devmem_read_ext_async
- devmem_read_ext_wait
ASR libraries should call asr_printf instead of printf or xcore’s debug_printf.
ASR libraries must not call malloc directly to allocate dynamic memory. Instead call the devmem_malloc and devmem_free functions. This allows the application to provide alternative implementations of these functions - like pvPortMalloc and vPortFree in a FreeRTOS application.
The devmem_read_ext function is provided to load data directly from external memory (QSPI flash or LPDDR) into SRAM. This is the recommended
way to load coefficients or blocks of data from a model. It is far more efficient to load the data into SRAM and perform any math on the
data while it is in SRAM. The devmem_read_ext function a signature similar to memcpy. The caller is responsible for
allocating the destination buffer.
Like devmem_read_ext, the devmem_read_ext_async function is provided to load data directly from external memory (QSPI flash or LPDDR) into SRAM. devmem_read_ext_async differs in that it does not block the caller’s thread. Instead it loads the data in another thread. One must have a free core when calling devmem_read_ext_async or an exception will be raised. devmem_read_ext_async returns a handle that can later be used to wait for the load to complete. Call devmem_read_ext_wait to block the callers thread until the load is complete. Currently, each call to devmem_read_ext_async must be followed by a call to devmem_read_ext_wait. You can not have more than one read in flight at a time.
Note
XMOS provides an arithmetic and DSP library which leverages the XS3 Vector Processing Unit (VPU) to accelerate costly operations on vectors of 16- or 32-bit data. Included are functions for block floating-point arithmetic, fast Fourier transforms, discrete cosine transforms, linear filtering and more. See the XMath Programming Guide for more information.
Note
To minimize SRAM scratch space usage, some ASR ports load coefficients into SRAM in chunks. This is useful when performing a routine such as a vector matrix multiply as this operation can be performed on a portion of the matrix at a time.
When the port of the new ASR is complete, you can use the example in examples/speech_recognition to test it.
Note
You may also need to modify BRICK_SIZE_SAMPLES in app_conf.h to match the number of audio samples expected per process for your ASR port. In other example designs, this is defined by appconfINTENT_SAMPLE_BLOCK_LENGTH. This is set to 240 in the existing example designs.
In the current source code, the model data (and optional grammar data) are set in examples/speech_recognition/src/process_file.c. Modify these variables to reflect your data. The remainder of the API should be familiar to ASR developers. The API can be extended if necessary.
To flash your model, modify the --data argument passed to xflash command in the Flashing the Model section.
See examples/speech_recognition/asr_example/asr_example_model.h to see how the model’s flash address is defined.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Automated Speech Recognition Porting$$$Modifying the Software$$$Placing Models in SRAM£££doc/programming_guide/asr/modifying.html#placing-models-in-sram
Small models (near or under 100kB in size) may be placed in SRAM. See examples/speech_recognition/asr_example/asr_example_model.c for more information on placing your model in SRAM.
Model type or version is not compatible with the ASR library.
enumeratorASR_MODEL_CORRUPT
Model malformed.
enumeratorASR_NOT_INITIALIZED
Not Initialized.
enumeratorASR_EVALUATION_EXPIRED
Evaluation period has expired.
typedefvoid*asr_port_t
Typedef to the ASR port context struct.
An ASR port can store any data needed in the context. The context pointer is passed to all API methods and can be cast to any struct defined by the ASR port.
typedefint16_tasr_sample_t
Typedef representing the base type of an audio sample.
Synchronous extended memory read function that allows the application
to provide an alternative implementation. Blocks the callers thread until the read is completed.
Call devmem_read_ext instead of any other functions to read memory from flash, LPDDR or SDRAM. Modules are free to use memcpy if the dest and src are both SRAM addresses.
Parameters:
ctx – A pointer to the device memory context.
dest – A pointer to the destination array where the content is to be read.
src – A pointer to the word-aligned address of data to be read.
Ports of the Sensory and Cyberon speech recognition libraries are provided.
Speech Recognition Ports
Filename/Directory
Description
modules/asr directory
include folder for ASR modules and ports
module/asr/sensory directory
contains the Sensory library and associated port code
module/asr/Cyberon directory
contains the Cyberon library and associated port code
modules/asr/CmakeLists.txt
CMakeLists file for adding ASR port targets
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command£££doc/programming_guide/ffd/ffd.html#far-field-voice-local-command
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Overview£££doc/programming_guide/ffd/overview.html#overview
This is the far-field voice local command (FFD) example design. Three examples are provided: all examples include speech recognition and a local dictionary. One example uses the Sensory TrulyHandsfree™ (THF) libraries, and the other ones use the Cyberon DSPotter™ libraries. The two examples with the Cyberon DSPotter™ libraries differ in the audio source fed into the intent engine. One example uses the audio source from the microphone array, and the other uses the audio source from the I2S interface.
The examples using the microphone array as the audio source include an audio pipeline with the following stages:
Interference Canceler (IC) + Voice To Noise Ratio Estimator (VNR)
Noise Suppressor (NS)
Adaptive Gain Control (AGC)
The FFD examples provide several options to inform the host of a possible intent detected by the intent engine. The device can notify the host by:
sending the intent ID over a UART interface upon detecting the intent
sending the intent ID over an I2C master interface upon detecting the intent
allowing the host to poll the last detected intent ID over the I2C slave interface
listening to an audio message over an I2S interface
When a wakeword phrase is detected followed by a command phrase, the application will output an audio response and a discrete message over I2C and UART.
Sensory’s THF and Cyberon’s DSpotter™ libraries ship with an expiring development license. The Sensory one will suspend recognition after 11.4 hours or 107 recognition events, and the Cyberon one will suspend recognition after 100 recognition events. After the maximum number of recognitions is reached, a device reset is required to resume normal operation. To perform a reset, either power cycle the device or press the SW2 button.
This example application is supported on the XK-VOICE-L71 board.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Supported Hardware$$$Setting up the Hardware£££doc/programming_guide/ffd/hardware.html#setting-up-the-hardware
This example design requires an XTAG4 and XK-VOICE-L71 board.
This example application features audio playback responses. Speakers can be connected to the LINE OUT on the XK-VOICE-L71.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Configuring the Firmware£££doc/programming_guide/ffd/deploying/configuration.html#configuring-the-firmware
The default application performs as described in the Overview. There are numerous compile time options that can be added to change the example design without requiring code changes. To change the options explained in the table below, add the desired configuration variables to the APP_COMPILE_DEFINITIONS cmake variable in the .cmake file located in the examples/ffd/ folder.
If options are changed, the application firmware must be rebuilt.
FFD Compile Options
Compile Option
Description
Default Value
appconfINTENT_ENABLED
Enables/disables the intent engine, primarily for debug.
1
appconfINTENT_RESET_DELAY_MS
Sets the period after the wake up phrase has been heard for a valid command phrase
5000
appconfINTENT_RAW_OUTPUT
Set to 1 to output all keywords found, skipping the internal wake up and command state machine
0
appconfAUDIO_PLAYBACK_ENABLED
Enables/disables the audio playback command response
1
appconfINTENT_UART_OUTPUT_ENABLED
Enables/disables the UART intent message
1
appconfINTENT_UART_DEBUG_INFO_ENABLED
Enables/disables the UART intent debug information
0
appconfI2C_MASTER_DAC_ENABLED
Enables/disables configuring the DAC over I2C master
1
appconfINTENT_I2C_MASTER_OUTPUT_ENABLED
Enables/disables sending the intent message over I2C master
1
appconfINTENT_I2C_MASTER_DEVICE_ADDR
Sets the address of the I2C device receiving the intent via the I2C master interface
0x01
appconfINTENT_I2C_SLAVE_POLLED_ENABLED
Enables/disables allowing another device to poll the intent message via I2C slave
0
appconfI2C_SLAVE_DEVICE_ADDR
Sets the address of the I2C device receiving the intent via the I2C slave interface
0x42
appconfINTENT_I2C_REG_ADDRESS
Sets the address of the I2C register to store the intent message, this value can be read via the I2C slave interface
0x01
appconfUART_BAUD_RATE
Sets the baud rate for the UART tx intent interface
9600
appconfUSE_I2S_INPUT
Replace I2S audio source instead of the microphone array audio source.
0
appconfI2S_MODE
Select I2S mode, supported values are appconfI2S_MODE_MASTER and appconfI2S_MODE_SLAVE
master
appconfI2S_AUDIO_SAMPLE_RATE
Select the sample rate of the I2S interface, supported values are 16000 and 48000
16000
appconfRECOVER_MCLK_I2S_APP_PLL
Enables/disables the recovery of the MCLK from the Software PLL application; this removes the need to use an external MCLK.
0
appconfINTENT_TRANSPORT_DELAY_MS
Sets the delay between host wake up requested and I2C and UART keyword code transmission
50
appconfINTENT_QUEUE_LEN
Sets the maximum number of detected intents to hold while waiting for the host to wake up
10
appconfINTENT_WAKEUP_EDGE_TYPE
Sets the host wake up pin GPIO edge type. 0 for rising edge, 1 for falling edge
0
appconfAUDIO_PIPELINE_SKIP_IC_AND_VNR
Enables/disables the IC and VNR
0
appconfAUDIO_PIPELINE_SKIP_NS
Enables/disables the NS
0
appconfAUDIO_PIPELINE_SKIP_AGC
Enables/disables the AGC
0
Note
The example_ffd_i2s_input_cyberon has different default values from the ones in the table above.
The list of updated values can be found in the APP_COMPILE_DEFINITIONS list in examples\ffd\ffd_i2s_input_cyberon.cmake.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Configuring the Firmware$$$Configuring the I2C interfaces£££doc/programming_guide/ffd/deploying/configuration.html#configuring-the-i2c-interfaces
The I2C interfaces are used to configure the DAC and to communicate with the host. The I2C interface can be configured as a master or a slave.
The DAC must be configured at bootup via the I2C master interface.
The I2C master is used when the FFD example asynchronously sends intent messages to the host. The I2C slave is used when the host wants to read intent messages from the FFD example through polling.
Note
The I2C interface cannot operate as both master and slave simultaneously. The FFD example design uses the I2C master interface to configure the DAC at device initialisation.
However, if the host reads intent messages from the FFD example using the I2C slave interface, the I2C master interface will be disabled after the DAC configuration is complete.
To send the intent ID via the I2C master interface when a command is detected, set the following variables:
appconfINTENT_I2C_MASTER_OUTPUT_ENABLED to 1.
appconfINTENT_I2C_MASTER_DEVICE_ADDR to the desired address used by the I2C slave device.
appconfINTENT_I2C_SLAVE_POLLED_ENABLED to 0, this will disable the I2C slave interface.
To configure the FFD example so that the host can poll for the intent via the I2C slave interface, set the following variables:
appconfINTENT_I2C_SLAVE_POLLED_ENABLED to 1.
appconfI2C_SLAVE_DEVICE_ADDR to the desired address used by the I2C master device.
appconfINTENT_I2C_REG_ADDRESS to the desired register read by the I2C master device.
appconfINTENT_I2C_MASTER_OUTPUT_ENABLED to 0, this will disable the I2C master interface after initialization.
The handling of the I2C slave registers is done in the examples\ffd\src\i2c_reg_handling.c file. The variable appconfINTENT_I2C_REG_ADDRESS is used in the callback function read_device_reg().
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Configuring the Firmware$$$Configuring the I2S interface£££doc/programming_guide/ffd/deploying/configuration.html#configuring-the-i2s-interface
The I2S interface is used to play the audio command response to the DAC, and/or to receive the audio samples from the host. The I2S interface can be configured as either a master or a slave.
To configure the I2S interface, set the following variables:
appconfI2S_ENABLED to 1.
appconfI2S_MODE to the desired mode, either appconfI2S_MODE_MASTER or appconfI2S_MODE_SLAVE.
appconfI2S_AUDIO_SAMPLE_RATE to the desired sample rate, either 16000 or 48000.
appconfRECOVER_MCLK_I2S_APP_PLL to 1 if an external MCLK is not available, otherwise set it to 0.
appconfAUDIO_PLAYBACK_ENABLED to 1, if the intent audio is to be played back.
appconfUSE_I2S_INPUT to 1, if the I2S audio source is to be used instead of the microphone array audio source.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS£££doc/programming_guide/ffd/deploying/linux_macos.html#deploying-the-firmware-with-linux-or-macos
This document explains how to deploy the software using CMake and Make.
Note
In the commands below <speech_engine> can be either sensory or cyberon, depending on the choice of the speech recognition engine and model.
Note
The Cyberon speech recognition engine is integrated in two examples. The example_ffd_cyberon use the microphone array as the audio source, and the example_ffd_i2s_input_cyberon uses the I2S interface as the audio source.
In the rest of this section, we use only the example_ffd_<speech_engine> as an example.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Building the Host Applications£££doc/programming_guide/ffd/deploying/linux_macos.html#building-the-host-applications
This application requires a host application to create the flash data partition. Run the following commands in the root folder to build the host application using your native Toolchain:
Note
Permissions may be required to install the host applications.
cmake -B build_hostcd build_hostmake install
The host applications will be installed at /opt/xmos/bin, and may be moved if desired. You may wish to add this directory to your PATH variable.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Building the Firmware£££doc/programming_guide/ffd/deploying/linux_macos.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Running the Firmware£££doc/programming_guide/ffd/deploying/linux_macos.html#running-the-firmware
Before running the firmware, the filesystem and model must be flashed to the
data partition.
Within the root of the build folder, run:
make flash_app_example_ffd_<speech_engine>
After this command completes, the application will be running.
After flashing the data partition, the application can be run without
reflashing. If changes are made to the data partition components, the
application must be reflashed.
From the build folder run:
xrun --xscope example_ffd_<speech_engine>.xe
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Debugging the Firmware£££doc/programming_guide/ffd/deploying/linux_macos.html#debugging-the-firmware
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Native Windows£££doc/programming_guide/ffd/deploying/native_windows.html#deploying-the-firmware-with-native-windows
This document explains how to deploy the software using CMake and Ninja. If you are not using native Windows MSVC build tools and instead using a Linux emulation tool such as WSL, refer to Deploying the Firmware with Linux or macOS.
To install Ninja follow install instructions at https://ninja-build.org/ or on Windows
install with winget by running the following commands in PowerShell:
# InstallwingetinstallNinja-build.ninja# Reload user Path$env:Path=[System.Environment]::GetEnvironmentVariable("Path","User")
Note
In the commands below <speech_engine> can be either sensory or cyberon, depending on the choice of the speech recognition engine and model.
Note
The Cyberon speech recognition engine is integrated in two examples. The example_ffd_cyberon use the microphone array as the audio source, and the example_ffd_i2s_input_cyberon uses the I2S interface as the audio source.
In the rest of this section, we use only the example_ffd_<speech_engine> as an example.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Building the Host Applications£££doc/programming_guide/ffd/deploying/native_windows.html#building-the-host-applications
This application requires a host application to create the flash data partition. Run the following commands in the root folder to build the host application using your native Toolchain:
Note
Permissions may be required to install the host applications.
Note
A C/C++ compiler, such as Visual Studio or MinGW, must be included in the path.
Before building the host application, you will need to add the path to the XTC Tools to your environment.
The host applications will be installed at %USERPROFILE%\.xmos\bin, and may be moved if desired. You may wish to add this directory to your PATH variable.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Building the Firmware£££doc/programming_guide/ffd/deploying/native_windows.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Running the Firmware£££doc/programming_guide/ffd/deploying/native_windows.html#running-the-firmware
Before running the firmware, the filesystem and model must be flashed to the data partition.
Within the root of the build folder, run:
ninja flash_app_example_ffd_<speech_engine>
After this command completes, the application will be running.
After flashing the data partition, the application can be run without reflashing. If changes are made to the data partition components, the application must be reflashed.
From the build folder run:
xrun --xscope example_ffd_<speech_engine>.xe
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Debugging the Firmware£££doc/programming_guide/ffd/deploying/native_windows.html#debugging-the-firmware
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software£££doc/programming_guide/ffd/modifying.html#modifying-the-software
The FFD example design is highly customizable. This section describes how to modify the application.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Host Integration£££doc/programming_guide/ffd/host_integration.html#host-integration
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/ffd/host_integration.html#overview
This section describes the connections that would need to be made to an external host for plug and play integration with existing devices.
When an intent is found, the XCORE device will check if the host is awake, by checking the Host Status GPIO pin. If the host is awake the intent code will be transmitted over I2C and/or UART.
If the host is not awake, the XCORE device will trigger a transition of the Wakeup GPIO pin. This can be configured to be a rising or falling edge. The XCORE device will then wait for a fixed period of time, set at compile time, before transmitting the intent over the I2C and/or UART interface. This behavior can be changed as desired by modifying the intent handling code.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$UART£££doc/programming_guide/ffd/host_integration.html#uart
UART Connections
FFD Connection
Host Connection
J4:24
UART RX
J4:20
GND
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$I2C£££doc/programming_guide/ffd/host_integration.html#i2c
I2C Connections
FFD Connection
Host Connection
J4:3
SDA
J4:5
SCL
J4:9
GND
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$GPIO£££doc/programming_guide/ffd/host_integration.html#gpio
GPIO Connections
FFD Connection
Host Connection
J4:19
Wake up input
J4:21
Host Status output
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Audio Pipeline£££doc/programming_guide/ffd/audio_pipeline.html#audio-pipeline
The audio pipeline in FFD processes two channel PDM microphone input into a single output channel, intended for use by an ASR engine.
The audio pipeline consists of 3 stages.
FFD Audio Pipeline
Stage
Description
Input Channel Count
Output Channel Count
1
Interference Canceller and Voice Noise Ratio
2
1
2
Noise Suppression
1
1
3
Automatic Gain Control
1
1
See the Voice Framework User Guide for more information.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Software Description£££doc/programming_guide/ffd/software_description.html#software-description
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/ffd/software_desc/overview.html#overview
The estimated power usage of the example application varies from 100-141 mW. This will vary based on component tolerances and any user added code and/or user added compile options.
FFD Resources
Resource
Tile 0
Tile 1
Total Memory Free
145k
208k
Runtime Heap Memory Free
38k
42k
FFD CPU Usage
Core ID
Typical Mean CPU Usage (%)
Standard Deviation CPU Usage (%)
Typical Min CPU usage (%, 10ms rolling)
Typical Max CPU usage (%, 10ms rolling)
tile[0], core[0]
0.006
0.345
0.000
21.030
tile[0], core[1]
0.072
2.031
0.000
80.690
tile[0], core[2]
0.082
2.287
0.000
100.000
tile[0], core[3]
1.666
2.906
0.000
54.560
tile[0], core[4]
65.925
27.828
0.000
91.220
tile[1], core[0]
0.014
0.540
0.000
27.440
tile[1], core[1]
99.990
0.505
74.000
100.000
tile[1], core[2]
99.990
0.507
73.870
100.000
tile[1], core[3]
18.272
13.259
0.000
98.220
tile[1], core[4]
17.231
11.048
0.000
37.260
Note that these are typical usage statistics for a representative run of the application on hardware. Core allocations may shift run-to-run in a scheduled RTOS.
These statistics are generated by slicing the representative run into 10 ms chunks and calculating % time per chunk not spent in the FreeRTOS IDLE tasks.
Therefore, the underlying distribution of these 10 ms bins should not be assumed to be Normal; this has implications on e.g. the interpretation of the Standard Deviation given here.
FFD Power Usage
Power State
Power (mW)
Always
114
The description of the software is split up by folder:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$audio_pipeline_init£££doc/programming_guide/ffd/software_desc/src.html#audio-pipeline-init
This function has the role of creating the audio pipeline, with two optional application pointers which are provided to the application in the audio_pipeline_input() and audio_pipeline_output() callbacks.
In FFD, the audio pipeline is initialized with no additional arguments, and instantiates a 3 stage pipeline on tile 1, as described in:
Audio Pipeline
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$audio_pipeline_input£££doc/programming_guide/ffd/software_desc/src.html#audio-pipeline-input
This function has the role of providing the audio pipeline with the input frames.
In FFD, the input is received from the rtos_mic_array driver.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$audio_pipeline_output£££doc/programming_guide/ffd/software_desc/src.html#audio-pipeline-output
This function has the role of receiving the processed audio pipeline output.
In FFD, the output is sent to the intent engine.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Main£££doc/programming_guide/ffd/software_desc/src.html#main
If replacing the existing model, these are the only two functions that are required to be populated.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$intent_engine_create£££doc/programming_guide/ffd/software_desc/intent_engine.html#intent-engine-create
This function has the role of creating the model running task and providing a pointer, which can be used by the application to handle the output intent result. In the case of the default configuration, the application provides a FreeRTOS Queue object.
The ASR engine is on tile 0 in both FFD and FFVA, but the audio pipeline output is on tile 1 for FFD and on tile 0 for FFVA.
The call to intent_engine_intertile_task_create() will create two threads on tile 0. One thread is the ASR engine thread. The other thread is an intertile rx thread, which will interface with the audio pipeline output.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$intent_engine_ready_sync£££doc/programming_guide/ffd/software_desc/intent_engine.html#intent-engine-ready-sync
This function is called by both tiles and serves to ensure that tile 0 is ready to receive
audio samples before starting the audio pipeline. This is a preventative measure to avoid dropping
samples at startup.
If replacing the existing handler code, this is the only function that is required to be populated.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$intent_handler_create£££doc/programming_guide/ffd/software_desc/intent_handler.html#intent-handler-create
This function has the role of creating the keyword handling task for the ASR engine. In the case of the Sensory and Cyberon models, the application provides a FreeRTOS Queue object. This handler is on the same tile as the speech recognition engine, tile 0.
The call to intent_handler_create() will create one thread on tile 0. This thread will receive ID packets from the ASR engine over a FreeRTOS Queue object and output over various IO interfaces based on configuration.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Software Modifications£££doc/programming_guide/ffd/software_modifications.html#software-modifications
The FFD example design consists of three major software blocks, the audio pipeline, keyword spotter, and keyword handler. This section will go into detail on how to replace each/all of these subsystems.
It is highly recommended to be familiar with the application as a whole before attempting replacing these functional units. This information can be found here:
Software Description
See Software Description for more details on the memory footprint and CPU usage of the major software components.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Replacing XCORE-VOICE DSP Block£££doc/programming_guide/ffd/software_modifications.html#replacing-xcore-voice-dsp-block
The audio pipeline can be replaced by making changes to the audio_pipeline.c file.
It is up to the user to ensure that the input and output frames of the audio pipeline remain the same, or the remainder of the application will not function properly.
This section will walk through an example of replacing the XMOS NS stage, with a custom stage foo.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Declaration and Definition of DSP Context£££doc/programming_guide/ffd/software_modifications.html#declaration-and-definition-of-dsp-context
It is also possible to add or remove stages. Refer to the RTOS Framework documentation on the generic pipeline sw_service.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Replacing Example Design Interfaces£££doc/programming_guide/ffd/software_modifications.html#replacing-example-design-interfaces
It may be desired to have a different output interface to talk to a host, or not have a host at all and handle the intent local to the XCORE device.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Different Peripheral IO£££doc/programming_guide/ffd/software_modifications.html#different-peripheral-io
To add or remove a peripheral IO, modify the bsp_config accordingly. Refer to documentation inside the RTOS Framework on how to instantiate different RTOS peripheral drivers.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Direct Control£££doc/programming_guide/ffd/software_modifications.html#direct-control
In a single controller system, the XCORE can be used to control peripherals directly.
The proc_keyword_res task can be modified as follows:
Intent Handler (intent_handler.c)
staticvoidproc_keyword_res(void*args){QueueHandle_tq_intent=(QueueHandle_t)args;int32_tid=0;while(1){xQueueReceive(q_intent,&id,portMAX_DELAY);/* User logic here */}}
This code example will receive the ID of each intent, and can be populated by any user application logic. User logic can use other RTOS drivers to control various peripherals, such as screens, motors, lights, etc, based on the intent engine outputs.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Speech Recognition - Sensory£££doc/programming_guide/ffd/speech_recognition_sensory.html#speech-recognition-sensory
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$License£££doc/programming_guide/ffd/speech_recognition_sensory.html#license
The Sensory TrulyHandsFree™ (THF) speech recognition library is Copyright (C) 1995-2022 Sensory Inc., All Rights Reserved.
Sensory THF software requires a commercial license granted by Sensory Inc.
This software ships with an expiring development license. It will suspend recognition after 11.4 hours
or 107 recognition events.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/ffd/speech_recognition_sensory.html#overview
The Sensory THF speech recognition engine runs proprietary models to identify keywords in an audio stream. Models can be generated using VoiceHub.
Two models are provided - one in US English and one in Mainland Mandarin. The US English model is used by default. To modify the software to use the Mandarin model, see the comment at the top of the ffd_sensory.cmake file. Make sure run the following commands to rebuild and re-flash the data partition:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Dictionary command table£££doc/programming_guide/ffd/speech_recognition_sensory.html#dictionary-command-table
English Language Demo
Utterances
Type
Return code (decimal)
Hello XMOS
keyword
1
Switch on the TV
command
3
Switch off the TV
command
4
Channel up
command
5
Channel down
command
6
Volume up
command
7
Volume down
command
8
Switch on the lights
command
9
Switch off the lights
command
10
Brightness up
command
11
Brightness down
command
12
Switch on the fan
command
13
Switch off the fan
command
14
Speed up the fan
command
15
Slow down the fan
command
16
Set higher temperature
command
17
Set lower temperature
command
18
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Application Integration£££doc/programming_guide/ffd/speech_recognition_sensory.html#application-integration
In depth information on out of the box integration can be found here: Host Integration
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Speech Recognition - Cyberon£££doc/programming_guide/ffd/speech_recognition_cyberon.html#speech-recognition-cyberon
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$License£££doc/programming_guide/ffd/speech_recognition_cyberon.html#license
Cyberon DSpotter™ software requires a commercial license granted by Cyberon Corporation.
This software ships with an expiring development license. It will suspend recognition after 100 recognition events.
Production versions of the DSpotter™ library are unrestricted when running on a specially licensed XMOS device. Please contact Cyberon or XMOS sales for further information.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/ffd/speech_recognition_cyberon.html#overview
The Cyberon DSpotter™ speech recognition engine runs proprietary models to identify keywords in an audio stream.
One model for US English is provided. For any technical questions or additional models please contact Cyberon.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Dictionary command table£££doc/programming_guide/ffd/speech_recognition_cyberon.html#dictionary-command-table
English Language Demo
Utterances
Type
Return code (decimal)
Hello XMOS
keyword
1
Hello Cyberon
keyword
1
Switch on the TV
command
2
Switch off the TV
command
3
Channel up
command
4
Channel down
command
5
Volume up
command
6
Volume down
command
7
Switch on the lights
command
8
Switch off the lights
command
9
Brightness up
command
10
Brightness down
command
11
Switch on the fan
command
12
Switch off the fan
command
13
Speed up the fan
command
14
Slow down the fan
command
15
Set higher temperature
command
16
Set lower temperature
command
17
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Local Command$$$Modifying the Software$$$Application Integration£££doc/programming_guide/ffd/speech_recognition_cyberon.html#application-integration
In depth information on out of the box integration can be found here: Host Integration
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command£££doc/programming_guide/low_power_ffd/low_power_ffd.html#low-power-far-field-voice-local-command
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Overview£££doc/programming_guide/low_power_ffd/overview.html#overview
The low power far-field voice local command (Low Power FFD) example design targets low power
speech recognition using Sensory’s TrulyHandsfree™ (THF) speech recognition and local dictionary.
When the small wake word model running on tile 1 recognizes a wake word utterance, the device
transitions to full power mode where tile 0’s command model begins receiving audio samples,
continuing the command recognition process. On command recognition, the application outputs a
discrete message over I2C and UART.
Tile 0’s command model, in combination with a timer, determines when to request a transition to low
power. Tile 1 may accept or reject this request based on its own timer that is reset on wake word
recognitions and potentially other application-specific events. The figure below illustrates the
general behavior.
When in low power mode, tile 0 is effectively disabled along with any peripheral/IO associated with
that tile.
Sensory’s THF software ships with an expiring development license. It will suspend recognition
after 11.4 hours or 107 recognition events; after which, a device reset is required to resume
normal operation. To perform a reset, either power cycle the device or press the SW2 button.
Note that SW2 is only functional while in full power mode (this application is configured to hold
the device in full-power mode on such license expiration events).
More information on the Sensory speech recognition library can be found here: Speech Recognition
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Supported Hardware£££doc/programming_guide/low_power_ffd/hardware.html#supported-hardware
This example application is supported on the XK-VOICE-L71 board.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Supported Hardware$$$Setting up the Hardware£££doc/programming_guide/low_power_ffd/hardware.html#setting-up-the-hardware
This example design requires an XTAG4 and XK-VOICE-L71 board.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Supported Hardware$$$xTAG£££doc/programming_guide/low_power_ffd/hardware.html#xtag
The xTAG is used to program and debug the device
Connect the xTAG to the debug header, as shown below.
Connect the micro USB XTAG4 and micro USB XK-VOICE-L71 to the programming host.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Configuring the Firmware£££doc/programming_guide/low_power_ffd/deploying/configuration.html#configuring-the-firmware
The default application performs as described in the Overview. There
are numerous compile time options that can be added to change the example design without requiring
code changes. To change the options explained in the table below, add the desired configuration
variables to the APP_COMPILE_DEFINITIONS CMake variable located in the example’s CMake file
here.
If options are changed, the application firmware must be rebuilt.
Low Power FFD Compile Options
Compile Option
Description
Default Value
appconfINTENT_RESET_DELAY_MS
Sets the period after the wake word phrase or subsequent command/wake word phrase has been heard for a valid command phrase
4000
appconfINTENT_UART_OUTPUT_ENABLED
Enables/disables the UART intent message
1
appconfINTENT_I2C_MASTER_OUTPUT_ENABLED
Enables/disables sending the intent message over I2C master
1
appconfUART_BAUD_RATE
Sets the baud rate for the UART tx intent interface
9600
appconfINTENT_I2C_MASTER_DEVICE_ADDR
Sets the I2C slave address to transmit the intent to
0x01
appconfINTENT_TRANSPORT_DELAY_MS
Sets the delay between host wake up requested and I2C and UART keyword code transmission
50
appconfINTENT_QUEUE_LEN
Sets the maximum number of detected intents to hold while waiting for the host to wake up
10
appconfINTENT_WAKEUP_EDGE_TYPE
Sets the host wake up pin GPIO edge type. 0 for rising edge, 1 for falling edge
0
appconfAUDIO_PIPELINE_SKIP_IC_AND_VNR
Enables/disables the IC and VNR
0
appconfAUDIO_PIPELINE_SKIP_NS
Enables/disables the NS
0
appconfAUDIO_PIPELINE_SKIP_AGC
Enables/disables the AGC
0
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS£££doc/programming_guide/low_power_ffd/deploying/linux_macos.html#deploying-the-firmware-with-linux-or-macos
This document explains how to deploy the software using CMake and Make.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Building the Host Applications£££doc/programming_guide/low_power_ffd/deploying/linux_macos.html#building-the-host-applications
This application requires a host application to create the flash data partition. Run the following
commands in the root folder to build the host application using your native toolchain:
Note
Permissions may be required to install the host applications.
cmake -B build_hostcd build_hostmake install
The host applications will be installed at /opt/xmos/bin, and may be moved if desired. You may
wish to add this directory to your PATH variable.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Building the Firmware£££doc/programming_guide/low_power_ffd/deploying/linux_macos.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Running the Firmware£££doc/programming_guide/low_power_ffd/deploying/linux_macos.html#running-the-firmware
Before running the firmware, the filesystem and command model must be flashed to the data partition.
Within the root of the build folder, run:
make flash_app_example_low_power_ffd_sensory
After this command completes, the application will be running.
After flashing the data partition, the application can be run without reflashing. If changes are
made to the data partition components, the application must be reflashed.
From the build folder run:
xrun --xscope example_low_power_ffd_sensory.xe
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Linux or macOS$$$Debugging the Firmware£££doc/programming_guide/low_power_ffd/deploying/linux_macos.html#debugging-the-firmware
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Native Windows£££doc/programming_guide/low_power_ffd/deploying/native_windows.html#deploying-the-firmware-with-native-windows
This document explains how to deploy the software using CMake and Ninja. If you are not using
native Windows MSVC build tools and instead using a Linux emulation tool such as WSL, refer to
Deploying the Firmware with Linux or macOS.
To install Ninja follow install instructions at https://ninja-build.org/ or on Windows
install with winget by running the following commands in PowerShell:
# InstallwingetinstallNinja-build.ninja# Reload user Path$env:Path=[System.Environment]::GetEnvironmentVariable("Path","User")
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Building the Host Applications£££doc/programming_guide/low_power_ffd/deploying/native_windows.html#building-the-host-applications
This application requires a host application to create the flash data partition. Run the following
commands in the root folder to build the host application using your native toolchain:
Note
Permissions may be required to install the host applications.
Note
A C/C++ compiler, such as Visual Studio or MinGW, must be included in the path.
Before building the host application, you will need to add the path to the XTC Tools to your environment.
The host applications will be installed at %USERPROFILE%\.xmos\bin, and may be moved if desired.
You may wish to add this directory to your PATH variable.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Building the Firmware£££doc/programming_guide/low_power_ffd/deploying/native_windows.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Running the Firmware£££doc/programming_guide/low_power_ffd/deploying/native_windows.html#running-the-firmware
Before running the firmware, the filesystem and command model must be flashed to the data partition.
Within the root of the build folder, run:
ninja flash_app_example_low_power_ffd_sensory
After this command completes, the application will be running.
After flashing the data partition, the application can be run without reflashing. If changes are
made to the data partition components, the application must be reflashed.
From the build folder run:
xrun --xscope example_low_power_ffd_sensory.xe
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Deploying the Firmware with Native Windows$$$Debugging the Firmware£££doc/programming_guide/low_power_ffd/deploying/native_windows.html#debugging-the-firmware
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software£££doc/programming_guide/low_power_ffd/modifying.html#modifying-the-software
The low-power FFD example design is highly customizable. This section describes how to modify the application.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Host Integration£££doc/programming_guide/low_power_ffd/host_integration.html#host-integration
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/low_power_ffd/host_integration.html#overview
This section describes the connections that would need to be made to an external host for plug and
play integration with existing devices.
When an intent is found, the XCORE device will check if the host is awake, by checking the Host
Status GPIO pin. If the host is awake the intent code will be transmitted over I2C and/or UART.
If the host is not awake, the XCORE device will trigger a transition of the Wakeup GPIO pin. This
can be configured to be a rising or falling edge. The XCORE device will then wait for a fixed
period of time, set at compile time, before transmitting the intent over the I2C and/or UART
interface. This behavior can be changed as desired by modifying the intent handling code.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$UART£££doc/programming_guide/low_power_ffd/host_integration.html#uart
UART Connections
Low Power FFD Connection
Host Connection
J4:24
UART RX
J4:20
GND
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$I2C£££doc/programming_guide/low_power_ffd/host_integration.html#i2c
I2C Connections
Low Power FFD Connection
Host Connection
J4:3
SDA
J4:5
SCL
J4:9
GND
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$GPIO£££doc/programming_guide/low_power_ffd/host_integration.html#gpio
GPIO Connections
Low Power FFD Connection
Host Connection
J4:19
Wake up input
J4:21
Host Status output
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Audio Pipeline£££doc/programming_guide/low_power_ffd/audio_pipeline.html#audio-pipeline
The audio pipeline in Low Power FFD processes two channel PDM microphone input into a single output channel, intended for use by an ASR engine.
The audio pipeline consists of 3 stages.
FFD Audio Pipeline
Stage
Description
Input Channel Count
Output Channel Count
1
Interference Canceller and Voice Noise Ratio
2
1
2
Noise Suppression
1
1
3
Automatic Gain Control
1
1
See the Voice Framework User Guide for more information.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Software Description£££doc/programming_guide/low_power_ffd/software_description.html#software-description
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/low_power_ffd/software_desc/overview.html#overview
The approximate resource utilizations for Low Power FFD are shown in the table below.
Low Power FFD Resources
Resource
Tile 0
Tile 1
Unused CPU Time (600MHz | 200MHz)
50%
10%
Total Memory Free
19.1k
5.3k
Runtime Heap Memory Free
219k
12.4k
The estimated (core) power usage for Low Power FFD are shown in the table below. Additional power
savings may be possible using Sensory’s Low Power Sound Detect (LPSD) option which approaches sub-50mW
operation in Low Power mode. These measurements will vary based on component tolerances and any user
added code and/or user added compile options.
Low Power FFD Power Usage
Power State
Core Power (mW)
Low Power
54
Full Power
110
The description of the software is split up by folder:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$bsp_config£££doc/programming_guide/low_power_ffd/software_desc/bsp_config.html#bsp-config
This folder contains bsp_configs for the Low Power FFD application. More information on bsp_configs
can be found in the RTOS Framework documentation.
Low Power FFD bsp_config
Filename/Directory
Description
dac directory
DAC ports for supported bsp_configs (not used in example, disabled)
XK_VOICE_L71 directory
default Low Power FFD application bsp_config
bsp_config.cmake
cmake for adding Low Power FFD bsp_configs
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$filesystem_support£££doc/programming_guide/low_power_ffd/software_desc/filesystem_support.html#filesystem-support
This folder contains filesystem contents for the Low Power FFD application.
Low Power FFD filesystem_support
Filename/Directory
Description
demo.txt
A file for demonstrative purposes containing the text “Hello World!”. This file is not used or interacted with in this application.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$model£££doc/programming_guide/low_power_ffd/software_desc/model.html#model
This folder contains the Sensory wake word and command model files the Low Power FFD application.
Note
Only a subset of the files below are used. See low_power_ffd.cmake for the files used by the
application. Also note the nibble-swapped net-file is manually generated, via the nibble_swap
tool found in lib_qspi_fast_read.
The command model’s net-file, in binary-form (nibble swapped, for supporting fast flash reads)
command-pc62w-6.1.0-op10-prod-net.c
The command model’s net-file, in source form
command-pc62w-6.1.0-op10-prod-search.bin
The command model’s search-file, in binary form
command-pc62w-6.1.0-op10-prod-search.c
The command model’s search-file, in source form
command-pc62w-6.1.0-op10-prod-search.h
The command model’s search header-file
command.snsr
The command model’s Sensory THF/TNL SDK “snsr” file
wakeword-pc60w-6.1.0-op10-prod-net.bin
The wake word model’s net-file, in binary-form
wakeword-pc60w-6.1.0-op10-prod-net.c
The wake word model’s net-file, in source form
wakeword-pc60w-6.1.0-op10-prod-search.bin
The wake word model’s search-file, in binary form
wakeword-pc60w-6.1.0-op10-prod-search.c
The wake word model’s search-file, in source form
wakeword-pc60w-6.1.0-op10-prod-search.h
The wake word model’s search header-file
wakeword.snsr
The wake word model’s Sensory THF/TNL SDK “snsr” file
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$src£££doc/programming_guide/low_power_ffd/software_desc/src.html#src
This folder contains the core application source.
FFD src
Filename/Directory
Description
gpio_ctrl directory
contains general purpose input handling and LED handling tasks
intent_engine directory
contains intent engine code
intent_handler directory
contains intent handling code
power directory
contains low power control logic and related audio buffer
rtos_conf directory
contains default FreeRTOS configuration headers
wakeword directory
contains wake word detection code
app_conf_check.h
header to validate app_conf.h
app_conf.h
header to describe app configuration
config.xscope
xscope configuration file
ff_appconf.h
default fatfs configuration header
main.c
main application source file
device_memory_impl.c
contains XCORE device memory functions for supporting ASR functionality
device_memory_impl.h
header for the device memory implementation
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Audio Pipeline£££doc/programming_guide/low_power_ffd/software_desc/src.html#audio-pipeline
The audio pipeline module provides the application with three API functions:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$audio_pipeline_init£££doc/programming_guide/low_power_ffd/software_desc/src.html#audio-pipeline-init
This function has the role of creating the audio pipeline, with two optional application pointers
which are provided to the application in the audio_pipeline_input() and audio_pipeline_output() callbacks.
In Low Power FFD, the audio pipeline is initialized with no additional arguments, and instantiates a
3 stage pipeline on tile 1, as described in:
Audio Pipeline
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$audio_pipeline_input£££doc/programming_guide/low_power_ffd/software_desc/src.html#audio-pipeline-input
This function has the role of providing the audio pipeline with the input frames.
In Low Power FFD, the input is received from the rtos_mic_array driver.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$audio_pipeline_output£££doc/programming_guide/low_power_ffd/software_desc/src.html#audio-pipeline-output
This function has the role of receiving the processed audio pipeline output.
In Low Power FFD, the output is sent to both the wake word handler and the intent engine. Because
the intent engine will be suspended in low power mode and that there is a finite time that it takes
to resume full power operation, there is a ring buffer placed between the audio output received
from this routine and the intent engine’s stream buffer.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Main£££doc/programming_guide/low_power_ffd/software_desc/src.html#main
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$startup_task£££doc/programming_guide/low_power_ffd/software_desc/src.html#startup-task
This function has the role of launching tasks on each tile. For those familiar with XCORE, it is comparable to the main par loop in an XC main.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$vApplicationMinimalIdleHook£££doc/programming_guide/low_power_ffd/software_desc/src.html#vapplicationminimalidlehook
This is a FreeRTOS callback. By calling “waiteu” without events configured, this has the effect of both MIPs and power savings on XCORE.
vApplicationMinimalIdleHook (main.c)
asmvolatile("waiteu");
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$tile_common_init£££doc/programming_guide/low_power_ffd/software_desc/src.html#tile-common-init
This function is the common tile initialization, which initializes the bsp_config, creates the startup task, and starts the FreeRTOS kernel.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$main_tile0£££doc/programming_guide/low_power_ffd/software_desc/src.html#main-tile0
This function is the application C entry point on tile 0, provided by the SDK.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$main_tile1£££doc/programming_guide/low_power_ffd/software_desc/src.html#main-tile1
This function is the application C entry point on tile 1, provided by the SDK.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$src/gpio_ctrl£££doc/programming_guide/low_power_ffd/software_desc/gpio_ctrl.html#src-gpio-ctrl
This folder contains the GPIO and LED related functionality for the Low Power FFD application.
Low Power FFD gpio_ctrl
Filename/Directory
Description
gpi_ctrl.c
The general purpose input control source file. Implements SW2 reset logic.
gpi_ctrl.h
The general purpose input control header file.
leds.c
The LED task source file. Handles the applications LED indications.
leds.h
The LED task header file.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$src/intent_engine£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#src-intent-engine
This folder contains the intent engine module for the low power FFD application.
Low Power FFD Intent Engine
Filename/Directory
Description
intent_engine_io.c
contains additional io intent engine code
intent_engine_support.c
contains general intent engine support code
intent_engine.c
contains the implementation of default intent engine code
intent_engine.h
header for intent engine code
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Major Components£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#major-components
The intent engine module provides the application with the following primary API functions:
These APIs provide the functionality needed to feed audio pipeline samples into the ASR engine.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$intent_engine_create£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#intent-engine-create
This function has the role of creating the model running task and providing a pointer, which can be
used by the application to handle the output intent result. In the case of the default configuration,
the application provides a FreeRTOS Queue object.
In Low Power FFD, the audio pipeline output is on tile 1 and the ASR engine on tile 0.
intent_engine_create snippet (intent_engine_io.c)
intent_engine_intertile_task_create(priority);
The call to intent_engine_intertile_task_create() will create two threads on tile 0. One thread is
the ASR engine thread. The other thread is an intertile RX thread, which will interface with the
audio pipeline output.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$intent_engine_ready_sync£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#intent-engine-ready-sync
This function is called by both tiles and serves to ensure that tile 0 is ready to receive
audio samples before starting the audio pipeline. This is a preventative measure to avoid dropping
samples at startup.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$intent_engine_sample_push£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#intent-engine-sample-push
This function has the role of sending the ASR output channel from the audio pipeline to the intent engine.
In Low Power FFD, the audio pipeline output is on tile 1 and the ASR engine on tile 0.
The call to intent_engine_samples_send_remote() will send the audio samples to the previously
configured intertile RX thread.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$intent_engine_process_asr_result£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#intent-engine-process-asr-result
This function can be replaced by the application to handle the intent in a completely different manner.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Low Power Components£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#low-power-components
The following APIs are the intent engine mechanisms needed by the power control task.
In this implementation, it is the responsibility of tile 0 (intent engine tile) to determine when
to request a transition into low power mode; however, tile 1 may reject the request. When tile 1
accepts the request (via LOW_POWER_ACK), the power control task calls intent_engine_low_power_accept.
When tile 1 rejects the request (via LOW_POWER_NAK), the power control task calls
intent_engine_full_power_request.
Note
There is an additional LOW_POWER_HALT response where the power control task calls
intent_engine_halt. This is primarily for end-of-evaluation handling logic for the underlying
ASR engine and is not needed for a normal application.
After tile 1 accepts the low power request, tile 0 begins preparations for entering low power by
locking various resources and waiting for any enqueued commands to finish up. The helper functions
below are provided for this purpose.
Before tile 1 sends LOW_POWER_ACK it also stops pushing audio samples via intent_engine_sample_push.
After receiving the low power response, the application may clear the stream buffer and keyword
queue to avoid processing stale samples/commands when returning to full power mode. The functions
below provide this functionality.
Since it is possible that a command is spoken/recognized between the time when tile 0 requests
low power and when tile 1 responds to the request, the application should not reset these
buffer entities until it has received LOW_POWER_ACK; otherwise, recognized commands may be lost.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Evaluation Specific Components£££doc/programming_guide/low_power_ffd/software_desc/intent_engine.html#evaluation-specific-components
The following functions are provided for the primary purpose of facilitating the evaluation of the
ASR model. The provided ASR models have evaluation periods which will end due to various factors.
When the evaluation period ends, the application logic halts the intent engine via intent_engine_halt.
This is primarily to ensure the device remains in full-power mode to allow functionality that may be
exclusive to tile 0 to function.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$src/intent_handler£££doc/programming_guide/low_power_ffd/software_desc/intent_handler.html#src-intent-handler
This folder contains ASR output handling modules for the Low Power FFD application.
FFD Intent handler
Filename/Directory
Description
intent_handler.c
contains the implementation of default intent handling code
intent_handler.h
header for intent handler code
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Major Components£££doc/programming_guide/low_power_ffd/software_desc/intent_handler.html#major-components
The intent handling module provides the application with one API function:
If replacing the existing handler code, this is the only function that is required to be populated.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$intent_handler_create£££doc/programming_guide/low_power_ffd/software_desc/intent_handler.html#intent-handler-create
This function has the role of creating the keyword handling task for the ASR engine. In the case of
the Sensory model, the application provides a FreeRTOS Queue object. This handler is on the same
tile as the Sensory engine, tile 0.
The call to intent_handler_create() will create one thread on tile 0. This thread will receive ID
packets from the ASR engine over a FreeRTOS Queue object and output over various IO interfaces based
on configuration.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$src/power£££doc/programming_guide/low_power_ffd/software_desc/power.html#src-power
This folder contains the low power control logic and supporting logic.
Low Power FFD power
Filename/Directory
Description
low_power_audio_buffer.c
Implementation of an audio sample ring buffer. Aids in responsiveness to commands during a transition to full power mode.
low_power_audio_buffer.c
Header for the low power audio buffer.
power_control.c
Implementation of the power control logic.
power_control.h
Header for power control logic.
power_state.c
Implementation of Tile 1 power state logic.
power_state.h
Header for power state logic.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Major Components£££doc/programming_guide/low_power_ffd/software_desc/power.html#major-components
The power control module provides the application with the following primary API functions:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_control_task_create£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-control-task-create
Creates and starts the power control task. To be called by each tile.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_control_exit_low_power£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-control-exit-low-power
Applicable only for Tile 1. Begins a transition to full power mode and is intended to be called by
the power_state_set() routine.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_control_state_get£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-control-state-get
Applicable only for Tile 1. Gets the current power state.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_control_halt£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-control-halt
Applicable only for Tile 1. Halts the power control task. This is provided primarily for
end-of-evaluation logic, but severs to terminate the low power logic. When halted, the system
remains in full power mode.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_control_req_low_power£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-control-req-low-power
Applicable only for Tile 0. Requests a transition to low power mode.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_control_ind_complete£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-control-ind-complete
Applicable only for Tile 0. Indication that the last step for preparing for a low power transition
has completed and allows the power control task to continue with final steps. This is primarily to
ensure the LED indications are up-to-date before driver locks are taken (which include GPIO/LED control).
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Power State Components£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-state-components
The power state module provides the application with the following primary API functions:
This module is also responsible for providing the base power state datatype (power_state_t) used by
other low power logic.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_state_init£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-state-init
Initializes the power state module. Responsible to initializing the underlying timer that effectively
determines whether a low power request by Tile 0 is accepted or rejected.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_state_set£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-state-set
Used by Tile 1’s application to signal full power events (such as wake word detection or other
application-specific events). Used by Tile 1’s power control logic to signal low power only after
Tile 0 has requested low power mode and the local timer has expired.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$power_state_timer_expired_get£££doc/programming_guide/low_power_ffd/software_desc/power.html#power-state-timer-expired-get
Used by the Tile 1’s power control logic to determine whether to accept or reject a low power request by Tile 0.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$src/wakeword£££doc/programming_guide/low_power_ffd/software_desc/wakeword.html#src-wakeword
This folder contains the wake word recognition functionality for the Low Power FFD application.
Low Power FFD wakeword
Filename/Directory
Description
wakeword.c
The wake word engine source file. Responsible for the transfer of audio samples into the ASR and handling of wake word detection events.
wakeword.h
The wake word engine header file.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Major Components£££doc/programming_guide/low_power_ffd/software_desc/wakeword.html#major-components
The wakeword module provides the application with two API functions:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$wakeword_init£££doc/programming_guide/low_power_ffd/software_desc/wakeword.html#wakeword-init
This function performs the required initialization for the wakeword_handler() function to
operate. This involves initializing an instance of devmem_manager_t for use by the ASR abstraction
layer and initialization of the ASR unit itself. It is to be called once during startup before any
call to wakeword_handler() occurs.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$wakeword_handler£££doc/programming_guide/low_power_ffd/software_desc/wakeword.html#wakeword-handler
This function performs wake word detection logic and reports back to the caller a result, indicating
whether a wake word was recognized. Note: this routine is called by audio_pipeline_output(), meaning
this routine’s logic should be kept to a minimum to ensure timing requirements are met.
In this implementation a single wake word ID of 1 is defined. Minimal adaptation is needed to support
other models supporting other IDs or more than one valid wake word.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Software Modifications£££doc/programming_guide/low_power_ffd/software_modifications.html#software-modifications
The Low Power FFD example design consists of four major software blocks: the audio pipeline,
ASR engine (wake word and intent engines), intent handler, and power control. This section will go
into detail on how to replace each subsystem.
It is highly recommended to be familiar with the application as a whole before attempting replacing
these functional units. This information can be found here:
Software Description
See Software Description for more details on the memory footprint and
CPU usage of the major software components.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Replacing XCORE-VOICE DSP Block£££doc/programming_guide/low_power_ffd/software_modifications.html#replacing-xcore-voice-dsp-block
The audio pipeline can be replaced by making changes to the audio_pipeline.c file.
It is up to the user to ensure that the input and output frames of the audio pipeline remain the
same, or the remainder of the application will not function properly.
This section will walk through an example of replacing the XMOS NS stage, with a custom stage foo.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Declaration and Definition of DSP Context£££doc/programming_guide/low_power_ffd/software_modifications.html#declaration-and-definition-of-dsp-context
typedefstructfoo_stage_ctx{/* Your required state context here */}foo_stage_ctx_t;staticfoo_stage_ctx_tfoo_stage_state={};
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$DSP Function£££doc/programming_guide/low_power_ffd/software_modifications.html#dsp-function
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Runtime Initialization£££doc/programming_guide/low_power_ffd/software_modifications.html#runtime-initialization
Replace:
XMOS NS (audio_pipeline.c)
ns_init(&ns_stage_state.state);
With:
Foo (audio_pipeline.c)
foo_init(&foo_stage_state.state);
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Audio Pipeline Setup£££doc/programming_guide/low_power_ffd/software_modifications.html#audio-pipeline-setup
It is also possible to add or remove stages. Refer to the RTOS Framework documentation on the
generic pipeline sw_service.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Replacing ASR Engine Block£££doc/programming_guide/low_power_ffd/software_modifications.html#replacing-asr-engine-block
Replacing the keyword spotter engine has the potential to require significant changes due to various
feature extraction input requirements and varied output logic.
The generic intent engine API only requires two functions be declared:
Intent API (intent_engine.h)
/* Generic interface for intent engines */int32_tintent_engine_create(uint32_tpriority,void*args);int32_tintent_engine_sample_push(asr_sample_t*buf,size_tframes);
Refer to the existing Sensory model implementation for details on how the output handler is set up,
how the audio is conditioned to the expected model format, and how it receives frames from the audio
pipeline.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Replacing Example Design Interfaces£££doc/programming_guide/low_power_ffd/software_modifications.html#replacing-example-design-interfaces
It may be desired to have a different output interface to talk to a host, or not have a host at all
and handle the intent local to the XCORE device.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Different Peripheral IO£££doc/programming_guide/low_power_ffd/software_modifications.html#different-peripheral-io
To add or remove a peripheral IO, modify the bsp_config accordingly. Refer to documentation inside
the RTOS Framework on how to instantiate different RTOS peripheral drivers.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Direct Control£££doc/programming_guide/low_power_ffd/software_modifications.html#direct-control
In a single controller system, the XCORE can be used to control peripherals directly.
The proc_keyword_res task can be modified as follows:
Intent Handler (intent_handler.c)
staticvoidproc_keyword_res(void*args){QueueHandle_tq_intent=(QueueHandle_t)args;int32_tid=0;while(1){xQueueReceive(q_intent,&id,portMAX_DELAY);/* User logic here */}}
This code example will receive the ID of each intent, and can be populated by any user application
logic. User logic can use other RTOS drivers to control various peripherals, such as screens,
motors, lights, etc, based on the intent engine outputs.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Replacing Example Power Control Logic£££doc/programming_guide/low_power_ffd/software_modifications.html#replacing-example-power-control-logic
Depending on the peripherals used in the end application, the requirements and handling of the
power control/state logic may need adaptation. The power control logic operates in a task where a
state machine that is common to both tiles is used. During steady state, each tile is expected to
remain is the same state. During transitions each tile executes its own state transition logic.
Below outlines the various functions that may need adaptation for a given application.
Locking drivers (power_control.c)
staticvoiddriver_control_lock(void){#if ON_TILE(POWER_CONTROL_TILE_NO)rtos_osal_mutex_get(&gpio_ctx_t0->lock,RTOS_OSAL_WAIT_FOREVER);#elsertos_osal_mutex_get(&qspi_flash_ctx->mutex,RTOS_OSAL_WAIT_FOREVER);/* User logic here */#endif}
Unlocking drivers (power_control.c)
staticvoiddriver_control_unlock(void){#if ON_TILE(POWER_CONTROL_TILE_NO)rtos_osal_mutex_put(&gpio_ctx_t0->lock);#else/* User logic here */rtos_osal_mutex_put(&qspi_flash_ctx->mutex);#endif}
This implementation also includes function calls that are for evaluation/diagnosis purposes and may
be removed for end applications. This includes calls to:
led_indicate_awake
led_indicate_asleep
When removing these calls, the associated call to power_control_ind_complete must either be moved
to another location in the application (this is currently handled in led.c’s led_task) or logic
associated with TASK_NOTIF_MASK_LP_IND_COMPLETE should be removed/disabled. The power_control_ind_complete
routine provides a basic means for the power control task to wait for another asynchronous process
to complete before proceeding with the state transition logic.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Speech Recognition£££doc/programming_guide/low_power_ffd/speech_recognition.html#speech-recognition
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$License£££doc/programming_guide/low_power_ffd/speech_recognition.html#license
The Sensory TrulyHandsFree™ (THF) speech recognition library is Copyright (C) 1995-2022 Sensory Inc., All Rights Reserved.
Sensory THF software requires a commercial license granted by Sensory Inc.
This software ships with an expiring development license. It will suspend recognition after 11.4 hours
or 107 recognition events.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Overview£££doc/programming_guide/low_power_ffd/speech_recognition.html#overview
The Sensory THF speech recognition engine runs proprietary models to identify keywords in an audio stream. Models can be generated using VoiceHub.
Two models are provided for the purpose of Low Power FFD. The small wake word model running on tile 1
is approximately 67KB. The command model running on tile 0 is approximately 289KB. On tile 1, the
Sensory runtime and application supporting code consumes approximately 239KB of SRAM. On tile 0, the
Sensory runtime and application supporting code consumes approximately 210KB of SRAM.
With the command model in flash, the Sensory engine requires a core frequency of at least 450 MHz to
keep up with real time. Additionally, the intent engine that is responsible for processing the
commands must be on the same tile as the flash.
To run with a different model, see the SetSensorymodelvariables section of the low_power_ffd.cmake file. There several variables are set pointing to files that are part of the VoiceHub generated model download. Change these variables to point to the files you downloaded. This can be done for both the wakeword and command models. The command model “net.bin” file, because it is placed in flash memory, must first be nibble swapped. A utility is provided that is part of the host applications built during install. Run that application with the following command:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Wake Word Dictionary£££doc/programming_guide/low_power_ffd/speech_recognition.html#wake-word-dictionary
English Language Wake Words
Return code (decimal)
Utterance
1
Hello XMOS
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Command Dictionary£££doc/programming_guide/low_power_ffd/speech_recognition.html#command-dictionary
English Language Commands
Return code (decimal)
Utterance
1
Switch on the TV
2
Channel up
3
Channel down
4
Volume up
5
Volume down
6
Switch off the TV
7
Switch on the lights
8
Brightness up
9
Brightness down
10
Switch off the lights
11
Switch on the fan
12
Speed up the fan
13
Slow down the fan
14
Set higher temperature
15
Set lower temperature
16
Switch off the fan
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Low Power Far-field Voice Local Command$$$Modifying the Software$$$Application Integration£££doc/programming_guide/low_power_ffd/speech_recognition.html#application-integration
In depth information on out of the box integration can be found here: Host Integration
This is the XCORE-VOICE far-field voice assistant example design.
This application can be used out of the box as a voice processor solution, or expanded to run local wakeword engines.
This application features a full duplex acoustic echo cancellation stage, which can be provided reference audio via I2S or USB audio. An audio output ASR stream is also available via I2S or USB audio.
By default, there are two audio integration options. The INT (Integrated) configuration uses I2S for reference and output audio streams. The UA (USB Accessory) configuration uses USB UAC 2.0 for reference and output audio streams.
Connect the xTAG to the debug header, as shown below.
Connect the micro USB XTAG4 and micro USB XK-VOICE-L71 to the programming host.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS£££doc/programming_guide/ffva/deploying/linux_macos.html#deploying-the-firmware-with-linux-or-macos
This document explains how to deploy the software using CMake and Make.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$Building the Host Applications£££doc/programming_guide/ffva/deploying/linux_macos.html#building-the-host-applications
This application requires a host application to create the flash data partition. Run the following commands in the root folder to build the host application using your native Toolchain:
Note
Permissions may be required to install the host applications.
cmake -B build_hostcd build_hostmake install
The host applications will be installed at /opt/xmos/bin, and may be moved if desired. You may wish to add this directory to your PATH variable.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$Building the Firmware£££doc/programming_guide/ffva/deploying/linux_macos.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the I2S firmware:
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$Running the Firmware£££doc/programming_guide/ffva/deploying/linux_macos.html#running-the-firmware
Before the firmware is run, the filesystem must be loaded.
Inside of the build folder root, after building the firmware, run one of:
make flash_app_example_ffva_int_fixed_delaymake flash_app_example_ffva_int_cyberon_fixed_delaymake flash_app_example_ffva_ua_adec_altarch
Once flashed, the application will run.
After the filesystem has been flashed once, the application can be run without flashing. If changes are made to the filesystem image, the application must be reflashed.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$Upgrading the Firmware£££doc/programming_guide/ffva/deploying/linux_macos.html#upgrading-the-firmware
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$UA variant£££doc/programming_guide/ffva/deploying/linux_macos.html#ua-variant
The UA variants of this application contain DFU over the USB DFU Class V1.1 transport method.
To create an upgrade image from the build folder run:
make create_upgrade_img_example_ffva_ua_adec_altarch
Once the application is running, a USB DFU v1.1 tool can be used to perform various actions. This example will demonstrate with dfu-util commands. Installation instructions for the respective operating systems can be found here.
The DFU interprets the flash as 3 separate partitions, the read only factory image, the read/write upgrade image, and the read/write data partition containing the filesystem.
The factory image can be read back by running:
dfu-util -e -d ,20b1:4001 -a 0 -U readback_factory_img.bin
The factory image can not be written to.
From the build folder, the upgrade image can be written by running:
dfu-util -e -d ,20b1:4001 -a 1 -D example_ffva_ua_adec_altarch_upgrade.bin
The upgrade image can be read back by running:
dfu-util -e -d ,20b1:4001 -a 1 -U readback_upgrade_img.bin
On system reboot, the upgrade image will always be loaded if valid. If the upgrade image is invalid, the factory image will be loaded. To revert back to the factory image, you can upload a file containing the word 0xFFFFFFFF.
The data partition image can be read back by running:
dfu-util -e -d ,20b1:4001 -a 2 -U readback_data_partition_img.bin
The data partition image can be written by running:
dfu-util -e -d ,20b1:4001 -a 2 -D readback_data_partition_img.bin
Note that the data partition will always be at the address specified in the initial flashing call.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$INT variant£££doc/programming_guide/ffva/deploying/linux_macos.html#int-variant
The INT variants of this application contain DFU over I2C.
To create an upgrade image from the build folder run:
make create_upgrade_img_example_ffva_int_fixed_delay
Once the application is running, the xvf_dfu tool can be used to perform various actions. Installation instructions for Raspbian OS can be found here.
Before running the xvf_dfu host application, the I2C_ADDRESS value in the file transport_config.yaml located in the same folder as the binary file xvf_dfu must be updated. This value must match the one set for appconf_CONTROL_I2C_DEVICE_ADDR in the platform_conf.h file.
The DFU interprets the flash as 3 separate partitions, the read only factory image, the read/write upgrade image, and the read/write data partition containing the filesystem.
The factory image can be read back by running:
xvf_dfu --upload-factory readback_factory_img.bin
The factory image can not be written to.
From the build folder, the upgrade image can be written by running:
On system reboot, the upgrade image will always be loaded if valid. If the upgrade image is invalid, the factory image will be loaded. To revert back to the factory image, you can upload a file containing the word 0xFFFFFFFF.
The FFVA-INT variants include some version numbers:
APP_VERSION_MAJOR
APP_VERSION_MINOR
APP_VERSION_PATCH
These values are defined in the app_conf.h file, and they can read by running:
xvf_dfu --version
The data partition image cannot be read or write using the xvf_dfu host application.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Linux or macOS$$$Debugging the Firmware£££doc/programming_guide/ffva/deploying/linux_macos.html#debugging-the-firmware
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Native Windows£££doc/programming_guide/ffva/deploying/native_windows.html#deploying-the-firmware-with-native-windows
This document explains how to deploy the software using CMake and Ninja. If you are not using native Windows MSVC build tools and instead using a Linux emulation tool, refer to Deploying the Firmware with Linux or macOS.
To install Ninja follow install instructions at https://ninja-build.org/ or on Windows
install with winget by running the following commands in PowerShell:
# InstallwingetinstallNinja-build.ninja# Reload user Path$env:Path=[System.Environment]::GetEnvironmentVariable("Path","User")
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Native Windows$$$Building the Host Applications£££doc/programming_guide/ffva/deploying/native_windows.html#building-the-host-applications
This application requires a host application to create the flash data partition. Run the following commands in the root folder to build the host application using your native Toolchain:
Note
Permissions may be required to install the host applications.
Note
A C/C++ compiler, such as Visual Studio or MinGW, must be included in the path.
Before building the host application, you will need to add the path to the XTC Tools to your environment.
The host applications will be installed at %USERPROFILE%\.xmos\bin, and may be moved if desired. You may wish to add this directory to your PATH variable.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Native Windows$$$Building the Firmware£££doc/programming_guide/ffva/deploying/native_windows.html#building-the-firmware
After having your python environment activated, run the following commands in the root folder to build the I2S firmware:
After the filesystem has been flashed once, the application can be run without flashing. If changes are made to the filesystem image, the application must be reflashed.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Native Windows$$$Upgrading the Firmware£££doc/programming_guide/ffva/deploying/native_windows.html#upgrading-the-firmware
The UA variants of this application contain DFU over the USB DFU Class V1.1 transport method.
In this section DFU over I2C for the INT variants is not covered. The INT variants require an I2C connection to the host, and Windows doesn’t support this feature.
To create an upgrade image from the build folder run:
Once the application is running, a USB DFU v1.1 tool can be used to perform various actions. This example will demonstrate with dfu-util commands. Installation instructions for respective operating system can be found here
The DFU interprets the flash as 3 separate partitions, the read only factory image, the read/write upgrade image, and the read/write data partition containing the filesystem.
The factory image can be read back by running:
dfu-util -e -d ,20b1:4001 -a 0 -U readback_factory_img.bin
The factory image can not be written to.
From the build folder, the upgrade image can be written by running:
dfu-util -e -d ,20b1:4001 -a 1 -D example_ffva_ua_adec_altarch_upgrade.bin
The upgrade image can be read back by running:
dfu-util -e -d ,20b1:4001 -a 1 -U readback_upgrade_img.bin
On system reboot, the upgrade image will always be loaded if valid. If the upgrade image is invalid, the factory image will be loaded. To revert back to the factory image, you can upload an file containing the word 0xFFFFFFFF.
The data partition image can be read back by running:
dfu-util -e -d ,20b1:4001 -a 2 -U readback_data_partition_img.bin
The data partition image can be written by running:
dfu-util -e -d ,20b1:4001 -a 2 -D readback_data_partition_img.bin
Note that the data partition will always be at the address specified in the initial flashing call.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Deploying the Firmware with Native Windows$$$Debugging the Firmware£££doc/programming_guide/ffva/deploying/native_windows.html#debugging-the-firmware
This example design can be integrated with existing solutions or modified to be a single controller solution.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$Out of the Box Integration£££doc/programming_guide/ffva/design.html#out-of-the-box-integration
Out of the box integration varies based on configuration.
INT requires I2S connections to the host. Refer to the schematic, connecting the host reference audio playback to the ADC I2S and the host input audio to the DAC I2S. Out of the box, the INT configuration requires an externally generated MCLK of 12.288 MHz. 24.576 MHz is also supported and can be changed via the compile option MIC_ARRAY_CONFIG_MCLK_FREQ, found in ffva_int.cmake.
UA requires a USB connection to the host.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$Support for ASR engine£££doc/programming_guide/ffva/design.html#support-for-asr-engine
The example_ffva_int_cyberon_fixed_delay provides an example about how to include an ASR engine, the Cyberon DSPotter™.
Most of the considerations made in the section about the FFD devices are still valid for the FFVA example. The only notable difference is that the pipeline output in the FFVA example
is on the same tile as the ASR engine, i.e. tile 0.
Note
Both the audio pipeline and the ASR engine process use the same sample block length. appconfINTENT_SAMPLE_BLOCK_LENGTH and appconfAUDIO_PIPELINE_FRAME_ADVANCE are both 240.
The application consists of a PDM microphone input which is fed through the XMOS-VOICE DSP blocks. The output ASR channel is then output over I2S or USB.
The DFU process is internally managed by the DFU controller module within the firmware.
This module is tasked with overseeing the DFU state machine and executing DFU operations.
The list of states and transactions are represented in the diagram in Fig. 1.
the appIDLE and appDETACH states are not implemented, and the device is started in the dfuIDLE state
the device goes into the dfuIDLE state when a SET_ALTERNATE message is received
the device is rebooted when a DFU_DETACH command is received.
The DFU allows the following operations:
download of an upgrade image to the device
upload of factory and upgrade images from the device
reboot of the device.
The rest of this section describes the message sequence charts of the supported operations.
A message sequence chart of the download operation is below:
Message sequence chart of the download operation
Note
The end of the image transfer is indicated by a DFU_DNLOAD message of size 0.
Note
The DFU_DETACH message is used to trigger the reboot.
Note
For the I2C implementation, specification of the block number in download is not supported; all downloads must start with block number 0 and must be run to completion. The device will track this progress internally.
A message sequence chart of the reboot operation is below:
Message sequence chart of the reboot operation
Note
The DFU_DETACH message is used to trigger the reboot.
A message sequence chart of the upload operation is below:
Message sequence chart of the upload operation
Note
The end of the image transfer is indicated by a DFU_UPLOAD message of size less than the transport medium maximum; this is 4096 bytes in UA and 128 bytes in INT.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$DFU over USB implementation£££doc/programming_guide/ffva/design.html#dfu-over-usb-implementation
The UA variant of the device makes use of a USB connection for handling DFU operations.
This interface is a relatively standard, specification-compliant implementation.
The implementation is encapsulated within the tinyUSB library, which provides a USB stack for the sln_voice.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$DFU over I2C implementation£££doc/programming_guide/ffva/design.html#dfu-over-i2c-implementation
The INT variant of the device presents a DFU interface that may be controlled
over I2C.
Fig. 5 shows the modules involved in
processing the DFU commands. The I2C task has a dedicated logical core so that it is always ready
to receive and send control messages. The DFU state machine is driven by the control commands. The DFU state
machine interacts with a separate RTOS task in
order to asynchronously perform flash read/write operations.
sln_voice Control Plane Components Diagram
Fig. 6 shows the interaction
between the Device Control module and the DFU Servicer.
In this diagram, boxes with the same colour reside in the same RTOS task.
sln_voice Device Control – Servicer Flow Chart
This diagram shows a critical aspect of the DFU control operation.
The Device Control module, having placed a command on a Servicer’s command
queue, waits on the Gateway queue for a response.
As a result, it ensures processing of a single control command at a time.
Limiting DFU control operation to a single command in-flight reduces the
complexity of the control protocol and eliminates several potential error
cases.
The FFVA-INT uses a packet protocol to receive control commands and send each
corresponding response.
Because packet transmission occurs over a very short-haul transport, as in
I2C, the protocol does not include fields for error detection or correction such as start-of-frame and
end-of-frame symbols, a cyclical redundancy check or an error correcting code.
Fig. 7 depicts the structure of each packet.
sln_voice Control Plane Packet Diagram
Packets containing a response from the FFVA-INT to the host application place
a status value in the first byte of the payload.
Mirroring the USB DFU specification, the INT DFU implementation supports a set of 9
control commands intended to drive the state machine, along with an additional 2
utility commands:
DFU commands
Name
ID
Length
Payload Structure
Purpose
DFU_DETACH
0
1
Payload unused
Write-only command. Restarts the device. Payload is required for protocol, but is discarded within the device. This command has a defined purpose in the USB DFU specification, but in a deviation to that specification it is used with I2C simply to reboot the device. Future versions of the XMOS DFU-by-device-control protocol (but not future versions of this product) may choose to alter the function of this command to more closely align with the USB DFU specification.
DFU_DNLOAD
1
130
2 bytes length marker, followed by 128 bytes of data buffer
Write-only command. The first two bytes indicate how many bytes of data are being transmitted in this packet. These bytes are little-endian, so byte 0 represents the low byte and byte 1 represents the high byte of an unsigned 16b integer. The remaining 128 bytes are a data buffer for transfer to the device. All control command packets are a fixed length, and therefore all 128 bytes must be included in the command, even if unused. For example, a payload with length of 100 should have the first 100 bytes of data set, but must send an additional 28 bytes of arbitrary data.
DFU_UPLOAD
2
130
2 bytes length marker, followed by 128 bytes of data buffer
Read-only command. The first two bytes indicate how many bytes of data are being transmitted in this packet. These bytes are little-endian, so byte 0 represents the low byte and byte 1 represents the high byte of an unsigned 16b integer. The remaining 128 bytes are a data buffer of data received from the device. All control command packets are a fixed length, and therefore this buffer will be padded to length 128 by the device before transmission. The device will, as per the USB DFU specification, mark the end of the upload process by sending a “short frame” - a packet with a length marker less than 128 bytes.
DFU_GETSTATUS
3
5
1 byte representing device status, 3 bytes representing the requested timeout, 1 byte representing the next device state.
Read-only command. The first byte returns the device status code, as described in the USB DFU specification in the table in section 6.1.2. The next 3 bytes represent the amount of time the host should wait, in ms, before issuing any other commands. This timeout is used in the DNLOAD process to allow the device time to write to flash. This value is little-endian, so bytes 1, 2, and 3 represent the low, middle, and high bytes respectively of an unsigned 24b integer. The final byte returns the number of the state that the device will move into immediately following the return of this request, as described in the USB DFU specification in the table in section 6.1.2.
DFU_CLRSTATUS
4
1
Payload unused
Write-only command. Moves the device out of state 10, dfuERROR. Payload is required for protocol, but is discarded within the device.
DFU_GETSTATE
5
1
1 byte representing current device state.
Read-only command. The first (and only) byte represents the number of the state that the device is currently in, as described in the USB DFU specification in the table in section 6.1.2.
DFU_ABORT
6
1
Payload unused
Write-only command. Aborts an ongoing upload or download process. Payload is required for protocol, but is discarded within the device.
DFU_SETALTERNATE
64
1
1 byte representing either factory (0) or upgrade (1) DFU target images
Write-only command. Sets which of the factory or upgrade images should be targeted by any subsequent upload or download commands. Use of this command entirely resets the DFU state machine to initial conditions: the device will move to dfuIDLE, clear all error conditions, wipe all internal DFU data buffers, and reset all other DFU state apart from the DFU_TRANSFERBLOCK value. This command is included to emulate the SET_ALTERNATE request available in USB.
DFU_TRANSFERBLOCK
65
2
2 bytes, representing the target transfer block for an upload process.
Read/write command. Sets/gets a 2 byte value specifying the transfer block number to use for a subsequent upload operation. A complete image may be conceptually divided into 128-byte blocks. These blocks may then be numbered from 0 upwards. Setting this value sets which block will be returned by a subsequent DFU_UPLOAD request. This value is initialised to 0, and autoincrements after each successful DFU_UPLOAD request has been serviced. Therefore, to read a whole image from the start, there is no need to issue this command - this command need only be used to select a specific section to read. Because this value is automatically incremented after a DFU_UPLOAD command is successfully serviced, reading it will give the value of the next block to be read (and this will be one greater than the previous block read, if it has not been altered in the interim). This value is reset to 0 at the successful completion of a DFU_UPLOAD process. It is not reset after a DFU_ABORT, nor after a DFU_SETALTERNATE call. This command is included to emulate the ability in a USB request to send values in the header of the request - the device control protocol used here does not allow sending any data with a read request such as DFU_UPLOAD.
DFU_GETVERSION
88
3
3 bytes, representing major.minor.patch version of device
Read-only command. Bytes 0, 1, and 2 represent the major, minor, and patch versions respectively of the device. This is a utility command intended to provide an easy mechanism by which to verify that a firmware download has been successful.
DFU_REBOOT
89
1
Payload unused
Write-only command. Restarts the device. Payload is required for protocol, but is discarded within the device. This is a utility command intended to provide a clear and unambiguous interface for restarting the device. Use of this command should be preferred over DFU_DETACH for this purpose.
When writing a custom compliant host application, the use of XMOS’ fwk_rtos
library is advised; the device_control library provided there gives a host
API that can communicate effectively with the FFVA-INT. A description of the I2C bus activity
during the execution of the above DFU commands is provided below, in the
instance that usage of the device_control library is inconvenient or
impossible.
The FFVA-INT I2C address is set by default as 0x42. This may be
confirmed by examination of the appconf_CONTROL_I2C_DEVICE_ADDR define in the
platform_conf.h file. The I2C address may also be altered by editing this file.
The DFU resource has an internal “resource ID” of 0xF0. This maps to the
register that read/write operations on the DFU resource should target -
therefore, the register to write to will always be 0xF0.
To issue a write command (e.g. DFU_SETALTERNATE):
First, set up a write to the device address. For a default device
configuration, a write operation will always start by a write token to 0x42
(START, 7 bits of address [0x42], R/W bit [0 to specify write]), wait for ACK,
followed by specifying the register to write [Resource ID 0xF0]
(and again wait for ACK).
Then, write the command ID (in this example, 64 [0x40]) from the above table.
Then, write the total transfer size, including the register byte. In this
example, that will be 4 bytes (register byte, command ID, length byte, and 1
byte of payload), so write 0x04.
Finally, send the payload - e.g. 1 to set the alternate setting to “upgrade”.
The full sequence for this write command will therefore be START, 7 bits of
address [0x42], 0 (to specify write), hold for ACK, 0xF0, hold for ACK, 0x40,
hold for ACK, 0x04, hold for ACK, 0x01, hold for ACK, STOP.
To complete the transaction, the device must then be queried; set up a read to
0x42 (START, 7 bits of address [0x42], R/W bit [1 to specify read], wait for
ACK). The device will clock-stretch until it is ready, at which point it will
release the clock and transmit one byte of status information. This will be a
value from the enum control_ret_t from device_control_shared.h,
found in modules\rtos\modules\sw_services\device_control\api.
To issue a read command (e.g. DFU_GETSTATUS):
Set up a write to the device; as above, this will mean sending START,
7 bits of device address [0x42], 0 (to specify write), hold for ACK. Send the
DFU resource ID [0xF0], hold for ACK.
Then, write the command ID (in this example, 3), bitwise ANDed with 0x80 (to
specify this as a read command) - in this example therefore 0x83 should be
sent, and hold for ACK.
Then, write the total length of the expected reply. In this example, the
command has a payload of 5 bytes. The device will also prepend the payload
with a status byte. Therefore, the expected reply length will be 6 bytes
[0x06]. Hold for ACK.
Then, issue a repeated START. Follow this with a read from the device:
the repeated START, 7 bits of device address [0x42], 1 (to specify read), hold
for ACK. The device will clock-stretch until it is ready. It will then send
a status byte (from the enum control_ret_t as described above), followed
by a payload of requested data - in this example, the device will send 5
bytes. ACK each received byte. After the last expected byte, issue a STOP.
This application features 16kHz and 48kHz audio input and output. The XMOS DPS blocks operate on 16kHz audio. Input streams are downsampled when needed. Output streams are upsampled when needed. When in I2S modes This function is called by the bsp_config to enable the I2S sample rate conversion.
The FFVA example design consists of three major software blocks, the audio interface, audio pipeline, and placeholder for a keyword handler. This section will go into detail on how to modify each/all of these subsystems.
It is highly recommended to be familiar with the application as a whole before attempting replacing these functional units.
See Memory and CPU Requirements for more details on the memory footprint and CPU usage of the major software components.
The audio pipeline can be replaced by making changes to the audio_pipeline.c file.
It is up to the user to ensure that the input and output frames of the audio pipeline remain the same, or the remainder of the application will not function properly.
This section will walk through an example of replacing the XMOS NS stage, with a custom stage foo.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$Declaration and Definition of DSP Context£££doc/programming_guide/ffva/software_modifications.html#declaration-and-definition-of-dsp-context
It is also possible to add or remove stages. Refer to the RTOS Framework documentation on the generic pipeline sw_service.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$Changing the ASR engine£££doc/programming_guide/ffva/software_modifications.html#changing-the-asr-engine
THE FFVA provides an example with a specific ASR engine. A different ASR engine can be used by updating and adding the necessary files in modules\asr.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$Far-field Voice Assistant$$$Modifying the Software$$$Replacing Example Design Interfaces£££doc/programming_guide/ffva/software_modifications.html#replacing-example-design-interfaces
It may be desired to have a different input or output interfaces to talk to a host.
One example use case may be to create a hybrid audio solution where reference frames or output audio streams are used over an interface other than I2S or USB.
Audio Pipeline Input (main.c)
voidaudio_pipeline_input(void*input_app_data,int32_t**input_audio_frames,size_tch_count,size_tframe_count){(void)input_app_data;int32_t**mic_ptr=(int32_t**)(input_audio_frames+(2*frame_count));staticintflushed;while(!flushed){size_treceived;received=rtos_mic_array_rx(mic_array_ctx,mic_ptr,frame_count,0);if(received==0){rtos_mic_array_rx(mic_array_ctx,mic_ptr,frame_count,portMAX_DELAY);flushed=1;}}rtos_mic_array_rx(mic_array_ctx,mic_ptr,frame_count,portMAX_DELAY);/* Your ref input source here */}
Refer to documentation inside the RTOS Framework on how to instantiate different RTOS peripheral drivers. Populate the above code snippet with your input frame source. Refer to the default application for an example of populating reference via I2S or USB.
Audio Pipeline Output (main.c)
intaudio_pipeline_output(void*output_app_data,int32_t**output_audio_frames,size_tch_count,size_tframe_count){(void)output_app_data;/* Your output sink here */#if appconfWW_ENABLEDww_audio_send(intertile_ctx,frame_count,(int32_t(*)[2])output_audio_frames);#endifreturnAUDIO_PIPELINE_FREE_FRAME;}
Refer to documentation inside the RTOS Framework on how to instantiate different RTOS peripheral drivers. Populate the above code snippet with your output frame sink. Refer to the default application for an example of outputting the ASR channel via I2S or USB.
To add or remove a peripheral IO, modify the bsp_config accordingly. Refer to documentation inside the RTOS Framework on how to instantiate different RTOS peripheral drivers.
This example is deprecated and will be moved into a separate
Application Note and may be removed in the next major release.
This example provides a bridge between 16 PDM microphones to either
TDM16 slave or USB Audio and targets the xcore-ai explorer board.
This application is to support cases where many microphone inputs need
to be sent to a host where signal processing will be performed. Please
see the other examples in sln_voice where signal processing is performed
within the xcore in firmware.
This example uses a modified mic_array with multiple decimator threads to
support 16 DDR microphones on a single 8 bit input port. The example is written as
‘bare-metal’ and runs directly on the XCORE device without an RTOS.
It is recommended to use Ninja or xmake as the make system under Windows.
Ninja has been observed to be faster than xmake, however xmake comes natively with XTC tools.
This firmware has been tested with Ninja version v1.11.1.
To install Ninja, activate your python environment, and run the following command:
$ pip install ninja
After having your python environment activated, run the following commands in the root folder to build the firmware:
The design consists of a number of tasks connected via the xcore-ai silicon communication channels.
The decimators in the microphone array are configured to produce a 48 kHz PCM output.
The 16 output channels are loaded into a 16 slot TDM slave peripheral running at 24.576 MHz bit
clock or a USB Audio Class 2 asynchronous interface and are optionally
amplified. The TDM build also provides a simple I2C slave interface to allow
gains to be controlled at run-time. The USB build supports USB Audio Class 2 compliant volume controls.
For the TDM build, a simple TDM16 master peripheral is included as well as a local
24.576 MHz clock source so that mic_array and TDM16 slave operation may be tested
standalone through the use of jumper cables. These may be removed when
integrating into a system with TDM16 master supplied.
The applications are written on bare metal and use logical cores (hardware threads)
to implement the functional blocks. Each of the tasks are connected using channels provided in the
xcore-ai architecture. The thread diagrams are shown in Fig. 8 and Fig. 9.
Both the TDM and USB aggregator examples share a common PDM front end. This consists of an 8 bit port
with each data line connected to two PDM microphones each configured to provide data
on a different clock edge. The 3.072 MHz clock for the PDM microphones is provided by the xcore-ai
device on a 1 bit port and clocks all PDM microphones. The PDM clock is divided down from the 24.576 MHz
local MCLK.
The data collected by the 8 bit port is sent to the lib_mic_array block which de-interleaves
the PDM data streams and performs decimation of the PDM data down to 48 kHz 32 bit PCM samples.
Due to the large number of microphones the PDM capture stage uses four hardware threads on tile[0]; one for the microphone
capture and three for decimation. This is needed to divide the processing workload and meet timing comfortably.
Samples are forwarded to the next stage at a rate of 48 kHz resulting in a packet of 16
PCM samples per exchange.
The 16 channels of 48 kHz PCM streams are collected by Hub and are amplified using a
saturated gain stage. The initial gain is set to 100, since a gain of 1 sounds very
quiet due to the mic_array output being scaled to allow acoustic
overload of the microphones without clipping within the decimators. This value can be
overridden using the MIC_GAIN_INIT define in app_conf.h.
Additionally for the TDM configuration, the Hub task also checks for control packets
from I2C which may be used to dynamically update the individual gains at runtime.
A single hardware thread contains the task and a triple buffer scheme is used to ensure there is always
a free buffer available to write into regardless of the relative phase between the production
and consumption of microphone samples.
The Hub task has plenty of timing slack and is a suitable place for adding signal processing
if needed.
The TDM build supports a 16-slot TDM slave Tx peripheral from the fwk_io sub-module. In this application
it runs at 24.576 MHz bit clock which supports 16 channels of 32 bit, 48 kHz samples per frame.
The TDM component uses a single hardware thread.
For the purpose of debugging a simple TDM 16 Master Rx component is provided. This allows the transmitted
TDM frames from the application to be received and checked without having to connect an external
TDM Master. It may be deleted / disconnected without affecting the core application.
Note
The simple TDM 16 Master Rx component is not regression tested and is for evaluation of TDM 16 Slave Tx in this application only.
The xcore-ai device has a total resource count of 2 x 524288 Bytes of memory and 2 x 8 hardware threads across two tiles.
This application uses around half of the processing resources and a tiny fraction of the available memory
meaning there is plenty of space inside the chip for additional functionality if needed.
For the TDM build, there are 32 registers which control the gain of each of the 16 output
channels. The 8 bit registers contain the upper 8 bit and lower 8 bit of the
microphone gain respectively. The initial gain is set to 100, since 1 is
quiet due to the mic_array output being scaled to allow acoustic
overload of the microphones without clipping. Typically a gain of a few
hundred works for normal conditions. The gain is only applied after the
lower byte is written.
The gain applied is saturating so no overflow will occur, only clipping.
Register
Value
0
Channel 0 upper gain byte
1
Channel 0 lower gain byte
2
Channel 1 upper gain byte
3
Channel 1 lower gain byte
4
Channel 2 upper gain byte
5
Channel 2 lower gain byte
6
Channel 3 upper gain byte
7
Channel 3 lower gain byte
8
Channel 4 upper gain byte
9
Channel 4 lower gain byte
10
Channel 5 upper gain byte
11
Channel 5 lower gain byte
12
Channel 6 upper gain byte
13
Channel 6 lower gain byte
14
Channel 7 upper gain byte
15
Channel 7 lower gain byte
16
Channel 8 upper gain byte
17
Channel 8 lower gain byte
18
Channel 9 upper gain byte
19
Channel 9 lower gain byte
20
Channel 10 upper gain byte
21
Channel 10 lower gain byte
22
Channel 11 upper gain byte
23
Channel 11 lower gain byte
24
Channel 12 upper gain byte
25
Channel 12 lower gain byte
26
Channel 13 upper gain byte
27
Channel 13 lower gain byte
28
Channel 14 upper gain byte
29
Channel 14 lower gain byte
30
Channel 15 upper gain byte
31
Channel 15 lower gain byte
If using a raspberry Pi as the I2C host you may use the following
commands:
$ i2cset -y 1 0x3c 0 0 #Set the gain on mic channel 0 to 50
$ i2cset -y 1 0x3c 1 50 #Set the gain on mic channel 0 to 50
$ i2cget -y 1 0x3c 0 #Get the upper byte of gain on mic channel 0
$ i2cget -y 1 0x3c 1 #Get the lower byte of gain on mic channel 0
$ i2cset -y 1 0x3c 16 1 #Set the gain on mic channel 8 to 256
$ i2cset -y 1 0x3c 15 0 #Set the gain on mic channel 8 to 256
This example is based on the RTOS framework and drivers. This choice simplifies the example design, but it leads to high latency in the system.
The main sources of latency are:
Large block size used for ASRC processing: this is necessary to minimise latency associated with the intertile context and thread switching overhead.
Large size of the buffer to which the ASRC output samples are written: a stable level (half full) must be reached before the start of streaming out over USB.
RTOS task scheduling overhead between the tasks.
bInterval of USB in the RTOS drivers is set to 4, i.e. one frame every 1 ms.
Block based implementation of the USB and I2S RTOS drivers.
The expected latencies for USB at 48 kHz are as follows:
USB -> ASRC -> I2S: from 8 ms at I2S at 192 kHz to 22 ms at 44.1 kHz
I2S -> ASRC -> USB: from 13 ms at I2S at 192 kHz to 19 ms at 44.1 kHz
For a proposed implementation with lower latency, please refer to the bare-metal examples below:
This is the XCORE-VOICE Asynchronous Sampling Rate Converter (ASRC) example design.
The example system implements a stereo I2S Slave and a stereo Adaptive UAC2.0 interface and exchanges data between the two interfaces.
Since the two interfaces are operating in different clock domains, there is an ASRC block between them that converts from the input to the output sampling rate.
There are two ASRC blocks, one each in the I2S -> ASRC -> USB and USB -> ASRC -> I2S path, as illustrated in the ASRC example top level system diagram.
The diagram also shows the rate calculation path, which monitors and computes the instantaneous ratio between the ASRC input and output sampling rate.
The rate ratio is used by the ASRC task to dynamically adapt filter coefficients using spline interpolation in its filtering stage.
ASRC example top level system diagram
The I2S Slave interface is a stereo 32 bit interface supporting sampling rates between 44.1 kHz - 192 kHz.
The USB interface is a stereo, 32 bit, 48 kHz, High-Speed, USB Audio Class 2, Adaptive interface.
The ASRC algorithm implemented in the lib_src library is used for the ASRC processing.
The ASRC processing is block based and works on a block size of 244 samples per channel in the I2S -> ASRC -> USB path and 96 samples per channel in the USB -> ASRC -> I2S path.
This example application is supported on the XK-VOICE-L71 board.
In addition to the XK-VOICE-L71 board, it requires an XTAG4 to program and debug the device.
To demonstrate the audio exchange between the I2S and USB interface, the XK-VOICE-L71 device needs to be connected to an I2S Master device.
To do this, connect the BCLK, MCLK, DOUT, DIN pins of the RASPBERRY PI HOST INTERFACE header (J4) on the XK-VOICE-L71 to the I2S Master.
The table XK-VOICE-L71 RPI host interface header (J4) connections lists the pins on the XK-VOICE-L71 RPI header and the signals on the I2S Master that they need to be connected to.
It is recommended to use Ninja or xmake as the make system under Windows.
Ninja has been observed to be faster than xmake, however xmake comes natively with XTC tools.
This firmware has been tested with Ninja version v1.11.1.
To install Ninja, activate your python environment, and run the following command:
$ pip install ninja
To build for the first time, activate your python environment, run cmake to create the
make files:
Following initial cmake build, for subsequent builds, as long as new source files are not added, just type:
$ ninja example_asrc_demo.xe
cmake needs to be rerun to discover any new source files added.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$ASRC Application$$$Overview$$$Running the app£££doc/programming_guide/asrc/overview.html#running-the-app
To run the app, either xrun or xflash can be used. Connect the XK-VOICE-L71 board to the host and type the following
to run with real-time debug output enabled:
$ xrun --xscope example_asrc_demo.xe
or to flash the application so that it always boots after a power cycle:
When the example runs, the audio received by the device on the I2S Slave interface at the I2S interface sampling rate is
sample rate converted using the ASRC to the USB sampling rate and streamed out from the device over the USB interface. Similarly,
the audio streamed out by the USB host into the USB interface of the device is sample rate converted to the I2S interface sampling
rate and streamed out from the device over the I2S Slave interface.
This example supports dynamic changes of the I2S interface sampling frequency at runtime. It detects the I2S sampling rate change and reconfigures
the system for the new rate.
The ASRC demo application is a two tile application developed to run on the XK-VOICE-L71 board running at a core frequency of 600 MHz.
It is a FreeRTOS based application where all the application blocks are implemented as FreeRTOS tasks.
Each tile has 5 bare metal cores dedicated to running RTOS tasks and since all processing is done within RTOS tasks, each core has 120 MHz of bandwidth
available.
The tasks can roughly be categorised as belonging to the USB driver, I2S driver or the application code categories.
The actual ASRC processing happens in four tasks across the two tiles; the usb_audio_out_asrc task, i2s_audio_recv_asrc task, and two instances of asrc_one_channel task, one on each tile.
This is described in more detail in the Application components section below.
Most of the tasks are involved in the ASRC processing data path, while a few are involved in monitoring the input and output data rates
and computing the rate ratio, which is the ratio between the frequencies at the input and output of the ASRC tasks.
The rate ratio is provided to the ASRC tasks every asrc_process_frame() call. Details about the rate ratio calculation are described in the rate_server section below.
This application presents a stereo, 48 kHz, 32 bit, high-speed, Adaptive UAC2.0 USB interface.
It has two endpoints, Endpoint 0 for control and Endpoint 1 for bidirectional isochronous USB audio.
The USB application level driver is TinyUSB based.
The usb_xud_thread, usb_isr, usb_task and usb_adaptive_clk_manager implement the USB driver.
Together, these tasks handle the USB communication with the host and also monitor the average USB rate seen by the device.
The average USB rate is used for calculating the rate ratios that are
sent to the asrc_process_frame() function. This is described more in the rate_server section.
The usb_xud_thread runs XUD_Main which implements the USB HIL driver. It runs on a dedicated bare metal core so cannot be preempted by other RTOS tasks.
It interfaces with the USB app level thread (usb_task) via shared memory and dedicated channels between the XUD_Main and each endpoint.
XUD_Main notifies the connected endpoint of a USB transfer completion through an interrupt on the respective channel. This interrupt is serviced by the usb_isr routine.
usb_task implements the app level USB driver functionality. The app level USB driver is based on TinyUSB which hooks into the application by means of callback functions.
The usb_isr task is triggered by the interrupt and parses the data transferred from XUD and places it on a queue that the usb_task blocks on for further processing.
For example, on completion of an EP1 OUT transfer, the transfer completion gets notified on the usb_xud_thread -> usb_isr -> usb_task path,
and the usb_task calls the tud_audio_rx_done_post_read_cb() function to have the application process the data received from the host.
On completion of an EP1 IN transfer, the transfer completion again follows the usb_xud_thread -> usb_isr -> usb_task path, and usb_task calls the tud_audio_tx_done_pre_load_cb()
callback function to have the application load the EP1 IN data for the next transfer.
samples_to_host_stream_buf and samples_from_host_stream_buf are circular buffers shared between the application and the USB driver and allow for decoupling one from the other.
The data frame received over USB from the host is written to the samples_from_host_stream_buf by the TinyUSB callback function tud_audio_rx_done_post_read_cb(),
while the application reads USB_TO_I2S_ASRC_BLOCK_LENGTH samples of data out of it.
Similarly, the application writes the ASRC output block of data to the samples_to_host_stream_buf while the TinyUSB callback function tud_audio_tx_done_pre_load_cb()
reads from it to send one frame of data to the USB host.
usb_adaptive_clk_manager task is responsible for calculating the average USB rate as seen by the device. The average rate is calculated over a 16-second moving window.
The averaging smooths out any jitter seen in the USB SOF timestamps that are used for calculating the rate.
This application presents a stereo 32 bit, I2S Slave interface that supports I2S sampling rates of 44.1, 48, 88.2, 96, 176.4 and 192 kHz.
The I2S driver supports tracking dynamic sampling rate (SR) changes and recalculates the nominal sampling rate after detecting a SR change event.
It also continuously monitors the timespan over which a fixed number of samples are received. This information is then used by the application for
calculating the average I2S rate seen by the device.
i2s_slave_thread, I2S send_buffer and receive_buffer and rtos_i2s_isr make up the I2S driver components.
i2s_slave_thread implements the I2S HIL driver. The HIL level driver calls into the application callback functions for i2s_init(), i2s_restart_check(), i2s_receive() and i2s_send().
These functions, in addition to handling I2S send and receive data, also detect sampling rate changes and gather information for tracking the average sampling rate.
I2S send_buffer and receive_buffer are circular buffers shared between the driver and the application and contain data received over I2S (receive_buffer) and data the application wants to send over I2S (send_buffer).
These buffers allow for decoupling the I2S HIL driver from the ASRC application. The driver reads from and writes to these buffers at the I2S sample rate while the application can read and write blocks of data to these buffers equal to the ASRC input or output block size.
The application calls rtos_i2s_rx() to read I2S_TO_USB_ASRC_BLOCK_LENGTH samples of data from the receive_buffer. The i2s_slave_thread independently calls i2s_receive() callback function to write a sample of data as it gets received over I2S.
Similarly, the application calls rtos_i2s_tx() to write ASRC output size block of data into the send_buffer. Meanwhile, the driver independently calls the callback function i2s_send() to read a sample of data to send over the I2S.
rtos_i2s_isr interrupt is used to ensure that the application calls to rtos_i2s_rx() and rtos_i2s_tx() block only on RTOS primitives when waiting for read data to be available or buffer space to be available when writing data.
usb_audio_out_asrc, i2s_audio_recv_asrc, asrc_one_channel_task, usb_to_i2s_intertile, i2s_to_usb_intertile and the rate_server tasks make up the non-driver components of the application.
usb_audio_out_asrc performs ASRC on data received from the USB host to the device. It waits to get notified by the TinyUSB callback function tud_audio_rx_done_post_read_cb() when there are one or more ASRC input blocks (96 USB samples) of data in the samples_from_host_stream_buf.
It does ASRC processing of the first channel while coordinating with the asrc_one_channel_task for processing the second channel in parallel and sends the processed output to the other tile on the inter-tile context.
i2s_audio_recv_asrc performs ASRC on data received over the I2S interface by the device. It blocks on the rtos_i2s_rx() function to receive one ASRC input block (244 I2S samples) of data from I2S and performs ASRC on one channel
while coordinating with the asrc_one_channel_task for processing the second channel in parallel. It then sends the processed output to the other tile on the inter-tile context.
asrc_one_channel_task performs ASRC on a single channel of data. There is one of these on each tile. It waits on an RTOS message queue for an ASRC input block to be available, does ASRC processing on the block and posts the completion notification on another message queue.
usb_to_i2s_intertile task receives the ASRC output data generated by usb_audio_out_asrc over the inter-tile context onto the I2S tile and writes it to the I2S send_buffer.
It has other rate-monitoring related responsibilities that are described in the rate_server section.
i2s_to_usb_intertile task receives the ASRC output data generated by i2s_audio_recv_asrc over the inter-tile context onto the USB tile and writes it to the USB samples_to_host_stream_buf.
It has other rate-monitoring related responsibilities that are described in the rate_server section.
The I2S -> ASRC -> USB data path diagram shows the application tasks involved in the I2S -> ASRC -> USB path processing and their interaction with each other.
I2S -> ASRC -> USB data path
The USB -> ASRC -> I2S data path diagram shows the application tasks involved in the USB -> ASRC -> I2S path processing and their interaction with each other.
The ASRC process_frame API requires the caller to calculate and send the instantaneous ratio between the ASRC input and output rate. The rate_server is responsible for calculating these rate ratios for both USB -> ASRC -> I2S and I2S -> ASRC -> USB directions.
Additionally, the application also monitors the average buffer fill levels of the buffers holding ASRC output to prevent any overflows or underflows of the respective buffer. A gradual drift in the buffer fill level indicates that the rate ratio is being under or over calculated by the rate_server.
This could happen either due to jitter in the actual rates or precision limitations when calculating the rates.
The average fill level of the buffer is monitored and a closed-loop error correction factor is calculated to keep the buffer level at an expected stable level.
The error estimated based on the buffer fill level is used to compute the estimated rate ratio from the initial rate ratio. This estimated rate ratio is then sent to the ASRC process_frame() API.
The rate_server runs on the I2S tile (tile 1) and is periodically triggered from the USB tile (tile 0) by the usb_to_i2s_intertile task. The rate_server is triggered once after every 16 frames are written to the samples_to_host_stream_buf.
The following information is needed for calculating the rate ratios:
The average I2S rate
The average USB rate
An error factor computed based on the USB samples_to_host_stream_buf fill level
An error factor computed based on the I2S sendbuffer fill level
A USB mic_interface_open flag indicating if the USB host is streaming out from the device,
since the rate ratio in the I2S -> ASRC -> USB direction is calculated only when the host is reading data from the device
A USB spkr_interface_open flag indicating if the USB host is streaming into the device,
since the rate ratio in the USB -> ASRC -> I2S direction is calculated only when the host is sending data to the device
Of the above, the USB related information (2, 3, 5 and 6 above) is available on the USB tile. When triggering the rate_server, the i2s_to_usb_intertile task gets this information,
either calculating it or getting it through shared memory from other USB tasks on the same tile, and sends it to the rate_server over the inter-tile context using the structure below.
The I2S related information (1 and 4 above) is calculated in the rate_server itself with information available for calculating these available through shared memory from other tasks on this tile.
After calculating the rates, the rate_server sends the rate ratio for the USB -> ASRC -> I2S side to the usb_to_i2s_intertile task over the inter-tile context and it is made available to the
usb_audio_out_asrc task through shared memory. The I2S -> ASRC -> USB side rate ratio is also made available to the i2s_audio_recv_asrc task through shared memory since it runs on the same tile as the rate server.
The Rate calculation code flow diagram shows the code flow during the rate ratio calculation process, focussing on the usb_to_intertile task that triggers the rate_server and the rate_server task where the rate ratios are calculated.
The I2S driver monitors the I2S nominal rate and provides this information to the application. When an I2S sampling rate change happens:
The ASRC instances on both tiles are re-initialised with the new sampling rate.
The buffers that are used for buffer-fill-level based correction are reset. Streaming out of them is paused while zeroes are sent out over both USB and I2S.
Once the buffers fill to a stable level, streaming out from them resumes.
The average buffer level calculation state is reset and the average buffer level calculation starts afresh.
New stable buffer levels are also calculated and the buffer levels are now corrected against these new stable averages.
Note that the device starts with the nominal I2S sampling rate set to zero. Device startup therefore follows the same path as an I2S sampling rate change where the sampling rate goes from zero to first detected nominal sampling rate.
Everything described above therefore also applies to the device startup behaviour.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$ASRC Application$$$Software Architecture$$$Handling USB speaker interface close -> open events£££doc/programming_guide/asrc/software_architecture.html#handling-usb-speaker-interface-close-open-events
When the USB host stops streaming to the device and then starts again, this event is detected through calls to the tud_audio_set_itf_close_EP_cb and tud_audio_set_itf_cb functions.
The ASRC output buffer in the USB -> ASRC -> I2S path (I2S send_buffer) is reset.
Zeroes are then sent over I2S until the buffer fills to a stable level, when we resume streaming out of this buffer to send samples over I2S.
The average buffer calculation state for the I2S send_buffer is also reset and a new stable average is calculated against which the average buffer levels are corrected.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Example Designs$$$ASRC Application$$$Software Architecture$$$Handling USB mic interface close -> open events£££doc/programming_guide/asrc/software_architecture.html#handling-usb-mic-interface-close-open-events
If the USB host stops streaming from the device and then starts again, this event is detected through calls to the tud_audio_set_itf_close_EP_cb and tud_audio_set_itf_cb functions.
The ASRC output buffer in the I2S -> ASRC -> USB is reset (USB samples_to_host_stream_buf).
Zeroes are streamed to the host until the buffer fills to a stable level, when we resume streaming out of this buffer to send samples over USB.
The average buffer calculation state for the USB samples_to_host_stream_buf is also reset and a new stable average is calculated against which the average buffer levels are corrected.
Out of the 524288 bytes of memory available per tile, this application uses approximately 262000 bytes of memory on Tile 0
and 208000 bytes of memory on Tile 1.
Profiling the CPU usage for this application using an RTOS friendly profiling tool is still TBD.
However, profiling some application tasks has taken place. These numbers along with some already existing profiling numbers for the drivers are listed in the Tile 0 tasks MIPS and Tile 1 tasks MIPS tables.
Each tile has 5 bare-metal cores being used for running RTOS tasks so each core has a fixed bandwidth of 120 MHz available.
ASRC in the I2S -> ASRC -> USB path for the worst case of 192 kHz to 48 kHz downsampling
75
usb_to_i2s_intertile
0.7
rate_server
19
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Memory and CPU Requirements£££doc/programming_guide/04_extending.html#memory-and-cpu-requirements
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Memory and CPU Requirements$$$Memory£££doc/programming_guide/04_extending.html#memory
The table below lists the approximate memory requirements for the larger software components. All memory use estimates in the table below are based on the default configuration for the feature. Alternate configurations will require more or less memory. The estimates are provided as guideline to assist application developers judge the memory cost of extending the application or benefit of removing an existing feature. It can be assumed that the memory requirement of components not listed in the table below are under 5 kB.
Memory Requirements
Component
Memory Use (kB)
Stereo Adaptive Echo Canceler (AEC)
275
Sensory Speech Recognition Engine
180
Cyberon Speech Recognition Engine
125
Interference Canceler (IC) + Voice To Noise Ratio Estimator (VNR)
130
USB
20
Noise Suppressor (NS)
15
Adaptive Gain Control (AGC)
11
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Memory and CPU Requirements$$$CPU£££doc/programming_guide/04_extending.html#cpu
The table below lists the approximate CPU requirements in MIPS for the larger software components. All CPU use estimates in the table below are based on the default configuration for the feature. Alternate configurations will require more or less MIPS. The estimates are provided as guideline to assist application developers judge the MIP cost of extending the application or benefits of removing an existing feature. It can be assumed that the memory requirement of components not listed in the table below are under 1%.
The following formula was used to convert CPU% to MIPS:
MIPS = (CPU% / 100%) * (600 MHz / 5 cores)
CPU Requirements (@ 600 MHz)
Component
CPU Use (%)
MIPS Use
USB XUD
100
120
I2S (slave mode)
80
96
Stereo Adaptive Echo Canceler (AEC)
80
96
Sensory Speech Recognition Engine
80
96
Cyberon Speech Recognition Engine
72
87
Interference Canceler (IC) + Voice To Noise Ratio Estimator (VNR)
The FFVA example design includes 2 basic configurations; INT and UA. The INT configuration is setup with I2S for input and output audio. The UA configuration is setup with USB for input and output audio. This HOWTO explains how to modify the FFVA example design for I2S input audio and USB output audio.
In the ffva_ua.cmake file, changing the appconfAEC_REF_DEFAULT to appconfAEC_REF_I2S will result in the expected input frames.
For integrating with I2S there are a few other differences from the default UA configuration. When integrating with an external Raspberry Pi BCLK and LRCLK, you will want the following FFVA_UA_COMPILE_DEFINITIONS:
appconfI2S_AUDIO_SAMPLE_RATE can also be 16000. Only 48k and 16k conversions is supported in FFVA.
The default FFVA INT device doesn’t require an external MCLK, but this setting can be changed by setting appconfEXTERNAL_MCLK=1. In this case the FFVA example application will sit at initialization until it can lock on to that clock source, so it MUST be active during boot.
Since the FFVA example application is not receiving reference audio through USB in this configuration, USB adaptive mode will not adapt to the input. By default, FFVA will output the configured nominal rate.
If you enable appconfAEC_REF_DEFAULT=appconfAEC_REF_I2S and appconfI2S_MODE=appconfI2S_MODE_MASTER. You need to invert I2S_DATA_IN and I2S_MIC_DATA in the bsp_config/XK_VOICE_L71/XK_VOICE_L71.xn file to have the reference audio play properly.
Lastly, with I2S enabled the DAC is always initialized by the FFVA example application. If FFVA cannot be the I2C host then it is up to the host to initialize the DAC, like in the AVS demo.
If you want to customize the XTC Tools commands like xflash and xrun, you can see what commands CMake is running by adding VERBOSE=1 to your build command line. For example:
make run_my_target VERBOSE=1
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Frequently Asked Questions$$$fatfs_mkimage: not found£££doc/programming_guide/06_faq.html#fatfs-mkimage-not-found
This issue occurs when the fatfs_mkimage host utility cannot be found. The most common cause for these issues are an incomplete installation of XCORE-VOICE.
Ensure that the host applications build and install has been completed. Verify that the fatfs_mkimage binary is installed to a location on PATH, or that the default application installation folder is added to PATH.
One potential issue with the low power FFD application is a crash after adding new code:
xrun: Program received signal ET_ECALL, Application exception. [Switching to tile[1] core[1]] 0x0008a182 in pdm_rx_isr ()
This generally occurs when there is not enough processing time available on tile 1, or when interrupts were disabled for too long, causing the mic array driver to fail to meet timing. To resolve reduce the processing time, minimize context switching and other actions that require kernel locks, and/or increase the tile 1 core clock frequency.
The clock dividers are set high to minimize core power consumption. This can make debugging a challenge or impossible. Even adding a simple printf can cause critical timing to be missed. In order to debug with the low-power features enabled, temporarily modify the clock dividers in app_conf.h.
XCORE ® -VOICE Solutions$$$XCORE-VOICE Programming Guide$$$Frequently Asked Questions$$$xcc2clang.exe: error: no such file or directory£££doc/programming_guide/06_faq.html#xcc2clang-exe-error-no-such-file-or-directory
Those strange characters at the beginning of the path are known as a byte-order mark (BOM). CMake adds them to the beginning of the response files it generates during the configure step. Why does it add them? Because the MSVC compiler toolchain requires them. However, some compiler toolchains, like gcc and xcc, do not ignore the BOM. Why did CMake think the compiler toolchain was MSVC and not the XTC toolchain? Because of a bug in which certain versions of CMake and certain versions of Visual Studio do not play nice together. The good news is that this appears to have been addressed in CMake version 3.22.3. Update to CMake version 3.22.2 or newer.
Copyright (c) 2017 Amazon.com, Inc., licensed under the MIT License
Sensory TrulyHandsfree™
The Sensory TrulyHandsfree™ speech recognition library is Copyright (C) 1995-2022 Sensory Inc. and is provided as an expiring development license. Commercial licensing is granted by Sensory Inc.
Cyberon DSpotter™
For any licensing questions about Cyberon DSpotter™ speech recognition library please contact Cyberon Corporation.
At the core of the Voice Framework are high-performance audio processing algorithms. The algorithms are connected in a pipeline that takes its input from a pair of the microphone and executes a series of signal processing algorithms to extract a voice signal from a complex soundscape. The audio pipeline can accept a reference signal from a host system which is used to perform Acoustic Echo Cancellation (AEC) to remove audio being played by the host. The audio pipeline provides two different output channels - one that is optimized for Automatic Speech Recognition systems and the other for voice communications.
A flexible audio signal routing infrastructure and a range of digital inputs and outputs enables the Voice Framework to be integrated into a wide range of system configurations, that can be configured at start up and during operation through a set of control registers. In addition, all source code is provided to allow for full customization or the addition of other audio processing algorithms.
lib_aec is a library which provides functions that can be put together to perform Acoustic Echo Cancellation (AEC)
on input mic data using the input reference data to model the room echo characteristics. lib_aec library functions
make use of functionality provided in lib_xcore_math to perform DSP operations. For more details refer to
AEC Overview.
lib_aec is included as part of the fwk_voice github repository
and all requirements for cloning and building fwk_voice apply. lib_aec is compiled as a static library as part of
overall fwk_voice build. It depends on lib_xcore_math.
The API can be categorised into high level and low level functions.
High level API has fewer input arguments and is simpler. However, it provides limited options for calling functions in parallel
across multiple threads. Keeping API simplicity in mind, most of the high level API functions accept a pointer to the AEC state
structure as an input and modify the relevant part of the AEC state. API and example documentation provides more
details about the fields within the state modified when calling a given function. High level API functions allow
2 levels of parallelism:
Single level of parallelism where for a given function, main and shadow filter processing can happen in parallel.
Two levels of parallelism where a for a given function, processing across multiple channels as well as main and shadow filter can be done in parallel.
Low level API has more input arguments but allows more freedom for running in parallel across multiple threads. Low
level API function names begin with a aec_l2_ prefix.
Depending on the low level API used, functions can be run in parallel to work over a range of bins or a range of phases.
This API is still a work in progress and will be fully supported in the future.
This repo is got as part of the parent fwk_voice repo clone. It is compiled as a static library as part of fwk_voice
compilation process.
To include lib_aec in an application as a static library, the generated libfwk_voice_module_lib_aec.a can then be linked into the
application. Be sure to also add lib_aec/api as an include directory for the application.
The lib_aec library provides functions that can be put together to
perform Automatic Echo Cancellation on input microphone data by using
input reference data to model the echo characteristics of the room.
The echo canceller takes in one or more channels of microphone (mic)
input and one or more channels of reference input data. The mic input is
the input captured by the device microphones. Reference input is the
audio that is played out of the device speakers. The echo canceller uses
the reference input to model the room echo characteristics for each
mic-loudspeaker pair and outputs an echo cancelled version of the mic
input. AEC uses adaptive filters, one per mic-speaker pair to constantly
remove echo from the the mic input. The filters continually adapt to the
acoustic environment to accommodate changes in the room created by
events such as doors opening or closing and people moving about.
Echo cancellation is performed on a frame by frame basis. Each frame is
made of 15msec chunks of data, which is 240 samples at 16kHz input
sampling frequency, per input channel. For example, for a 2 mic channel
and 2 reference channel input configuration, an input frame is made of
2x240 samples of mic data and 2x240 samples of reference data. Input
data is expected to be in fixed point 32bit 1.31 format. Further, in
this example, there will be a total of 4 adaptive filters;
\(\hat{H}_{y0x0}\), \(\hat{H}_{y0x1}\), \(\hat{H}_{y1x0}\)
and \(\hat{H}_{y1x1}\), monitoring the echo seen in mic channel 0
from reference channel 0 and 1 and echo seen in mic channel 1 from
reference channel 0 and 1.
Microphone data is referred to as \(y\) when in time domain and
\(Y\) when in frequency domain. In general throughout the code,
names starting with lower case represent time domain and those beginning
with upper case represent frequency domain. For example \(error\) is
the filter error and \(Error\) is the spectrum of the filter error.
Reference input is referred to as \(x\) in time domain and \(X\)
when in frequency domain. Filter is referred to as \(\hat{h}\) in
time domain and \(\hat{H}\) in frequency domain.
A filter has multiple phases. The term phases refers to the tail length
of the filter. A filter with more phases or a longer tail length will be
able to model a more reverberant room response leading to better echo
cancellation.
There are 2 types of adaptive filters used in the AEC. These are
referred to as main filter and shadow filter. The main filter as the
name suggests is the main filter that is used to generate the echo
cancelled output of the AEC. Shadow filter is a filter that used to
quickly detect and respond to changes in the room transfer function.
There is one main filter and one shadow filter per \(x\)-\(y\)
pair. Typically the main filter has more phases than the shadow filter.
Fewer phases in the shadow filter enable it to rapidly detect and
respond to changes while more phases in main filter lead to deeper
convergence and hence better echo cancellation at the AEC output.
Before starting AEC processing or every time there’s a configuration
change, the user needs to call aec_init() to initialise the echo
canceller for a desired configuration. Once the AEC is initialised, the
library functions can be called in a logical order to perform echo
cancellation on a frame by frame basis. Refer to the aec_1_thread and
aec_2_threads examples to see how the functions are called to perform
echo cancellation using one thread or 2 threads.
counter for tracking shadow filter copy to main filter
structaec_shared_state_t
#include <aec_state.h>
AEC shared state structure.
Data structures holding AEC persistent state that is common between main filter and shadow filter. aec_state_t::shared_state for both main and shadow filter point to the common aec_shared_t structure. [aec_shared_state_t]
BFP array pointing to the reference input spectrum phases. The term phase refers to the spectrum data for a frame. Multiple phases means multiple frames of data.
For example, 10 phases would mean the 10 most recent frames of data. Each phase spectrum, pointed to by X_fifo[i][j]->data is stored as a length AEC_FD_FRAME_LENGTH, complex 32bit array.
The phases are ordered from most recent to least recent in the X_fifo. For example, for an AEC configuration of 2 x-channels and 10 phases per x channel, 10 frames of X data spectrum is stored in the X_fifo. For a given x channel, say x channel 0, X_fifo[0][0] points to the most recent frame’s X spectrum and X_fifo[0][9] points to the last phase, i.e the least recent frame’s X spectrum.
BFP array pointing to time domain mic input processing block. The y data values are stored as length AEC_PROC_FRAME_LENGTH, 32bit integer array per y channel.
BFP array pointing to time domain reference input processing block. The x data values are stored as length AEC_PROC_FRAME_LENGTH, 32bit integer array per x channel.
BFP array pointing to time domain mic input values from the previous frame. These are put together with the new samples received in the current frame to make a AEC_PROC_FRAME_LENGTH processing block. The prev_y data values are stored as length (AEC_PROC_FRAME_LENGTH - AEC_FRAME_ADVANCE), 32bit integer array per y channel.
BFP array pointing to time domain reference input values from the previous frame. These are put together with the new samples received in the current frame to make a AEC_PROC_FRAME_LENGTH processing block. The prev_x data values are stored as length (AEC_PROC_FRAME_LENGTH - AEC_FRAME_ADVANCE), 32bit integer array per x channel.
BFP array pointing to sigma_XX values which are the weighted average of the X_energy signal. The sigma_XX data is stored as 32bit integer array of length AEC_FD_FRAME_LENGTH
Exponential moving average of the time domain mic signal energy. This is calculated by calculating energy per sample and summing across all samples. Stored in a y channels array with every value stored as a 32bit integer mantissa and exponent.
Exponential moving average of the time domain reference signal energy. This is calculated by calculating energy per sample and summing across all samples. Stored in a x channels array with every value stored as a 32bit integer mantissa and exponent.
Energy of the mic input spectrum. This is calculated by calculating the energy per bin and summing across all bins. Stored in a y channels array with every value stored as a 32bit integer mantissa and exponent.
Sum of the X_energy across all bins for a given x channel. Stored in a x channels array with every value stored as a 32bit integer mantissa and exponent.
Data structures holding AEC persistent state. There are 2 instances of aec_state_t maintained within AEC; one for main filter and one for shadow filter specific state. [aec_state_t]
BFP array pointing to estimated mic signal spectrum. The Y_data data values are stored as length AEC_FD_FRAME_LENGTH, complex 32bit array per y channel.
BFP array pointing to adaptive filter error signal spectrum. The Error data is stored as length AEC_FD_FRAME_LENGTH, complex 32bit array per y channel.
BFP array pointing to the adaptive filter spectrum. The filter spectrum is stored as a num_y_channels x total_phases_across_all_x_channels array where each H_hat[i][j] entry points to the spectrum of a single phase.
Number of phases in the filter refers to its tail length. A filter with more phases would be able to model a longer echo thereby causing better echo cancellation.
For example, for a 2 y-channels, 3 x-channels, 10 phases per x channel configuration, the filter spectrum phases are stored in a 2x30 array. For a given y channel, say y channel 0, H_hat[0][0] to H_hat[0][9] points to 10 phases of H_haty0x0, H_hat[0][10] to H_hat[0][19] points to 10 phases of H_haty0x1 and H_hat[0][20] to H_hat[0][29] points to 10 phases of H_haty0x2.
Each filter phase data which is pointed to by H_hat[i][j].data is stored as AEC_FD_FRAME_LENGTH complex 32bit array.
BFP array pointing to all phases of reference input spectrum across all x channels. Here, the reference input spectrum is saved in a 1 dimensional array of phases, with x channel 0 phases followed by x channel 1 phases and so on. For example, for a 2 x-channels, 10 phases per x channel configuration, X_fifo_1d[0] to X_fifo_1d[9] points to the 10 phases for channel 0 and X_fifo[10] to X_fifo[19] points to the 10 phases for channel 1.
Each X data spectrum phase pointed to by X_fifo_1d[i][j].data is stored as length AEC_FD_FRAME_LENGTH complex 32bit array.
BFP array pointing to the X_energy data which is the energy per bin of the X spectrum summed over all phases of the X data. X_energy data is stored as a length AEC_FD_FRAME_LENGTH, integer 32bit array per x channel.
BFP array pointing to time domain overlap data values which are used in the overlap add operation done while calculating the echo canceller time domain output. Stored as a length 32, 32 bit integer array per y channel.
Exponential moving average of the time domain adaptive filter error signal energy. Stored in an x channels array with every value stored as a 32bit integer mantissa and exponent.
Maximum X energy across all values of X_energy for a given x channel. Stored in an x channels array with every value stored as a 32bit integer mantissa and exponent.
pointer to the state data shared between main and shadow filter.
unsignednum_phases
Number of filter phases per x-y pair that AEC filter is configured for. This is the input argument num_main_filter_phases or num_shadow_filter_phases, depending on which filter the aec_state_t is instantiated for, passed in aec_init() call.
Maximum number of microphone input channels supported in the library. Microphone input to the AEC refers to the input from the device’s microphones from which AEC removes the echo created in the room by the device’s loudspeakers.
AEC functions follow the convention of using \(y\) and \(Y\) for referring to time domain and frequency domain representation of microphone input.
The num_y_channels passed into aec_init() call should be less than or equal to AEC_LIB_MAX_Y_CHANNELS. This define is only used for defining data structures in the aec_state. The library code implementation uses only the num_y_channels aec is initialised for in the aec_init() call.
AEC_LIB_MAX_X_CHANNELS
Maximum number of reference input channels supported in the library. Reference input to the AEC refers to a copy of the device’s speaker output audio that is also sent as an input to the AEC. It is used to model the echo characteristics between a mic-loudspeaker pair.
AEC functions follow the convention of using \(x\) and \(X\) for referring to time domain and frequency domain representation of reference input.
The num_x_channels passed into aec_init() call should be less than or equal to AEC_LIB_MAX_X_CHANNELS. This define is only used for defining data structures in the aec_state. The library code implementation uses only the num_x_channels aec is initialised for in the aec_init() call.
AEC_FRAME_ADVANCE
AEC frame size This is the number of samples of new data that the AEC works on every frame. 240 samples at 16kHz is 15msec. Every frame, the echo canceller takes in 15msec of mic and reference data and generates 15msec of echo cancelled output.
AEC_PROC_FRAME_LENGTH
Time domain samples block length used internally in AEC’s block LMS algorithm
AEC_FD_FRAME_LENGTH
Number of bins of spectrum data computed when doing a DFT of a AEC_PROC_FRAME_LENGTH length time domain vector. The AEC_FD_FRAME_LENGTH spectrum values represent the bins from DC to Nyquist.
AEC_LIB_MAX_PHASES
Maximum total number of phases supported in the AEC library This is the maximum number of total phases supported in the AEC library. Total phases are calculated by summing phases across adaptive filters for all x-y pairs.
For example. for a 2 y-channels, 2 x-channels, 10 phases per x channel configuration, there are 4 adaptive filters, H_haty0x0, H_haty0x1, H_haty1x0 and H_haty1x1, each filter having 10 phases, so the total number of phases is 40. When aec_init() is called to initialise the AEC, the num_y_channels, num_x_channels and num_main_filter_phases parameters passed in should be such that num_y_channels * num_x_channels * num_main_filter_phases is less than equal to AEC_LIB_MAX_PHASES.
This define is only used when defining data structures within the AEC state structure. The AEC algorithm implementation uses the num_main_filter_phases and num_shadow_filter_phases values that are passed into aec_init().
AEC_UNUSED_TAPS_PER_PHASE
Overlap data length
AEC_FFT_PADDING
Extra 2 samples you need to allocate in time domain so that the full spectrum (DC to nyquist) can be stored after the in-place FFT. NOT USER MODIFIABLE.
This function initializes AEC data structures for a given configuration. The configuration parameters num_y_channels, num_x_channels, num_main_filter_phases and num_shadow_filter_phases are passed in as input arguments.
This function needs to be called at startup to first initialise the AEC and subsequently whenever the AEC configuration changes.
main_state, shadow_state and shared_state structures must start at double word aligned addresses.
main_mem_pool and shadow_mem_pool must point to memory buffers big enough to support main and shadow filter processing. AEC state aec_state_t and shared state aec_shared_state_t structures contain only the BFP data structures used in the AEC. The memory these BFP structures will point to needs to be provided by the user in the memory pool main and shadow filters memory pool. An example memory pool structure is present in aec_memory_pool_t and aec_shadow_filt_memory_pool_t.
main_mem_pool and shadow_mem_pool must also start at double word aligned addresses.
Example
#include"aec_memory_pool.h"aec_state_tDWORD_ALIGNEDmain_state;aec_state_tDWORD_ALIGNEDshadow_state;aec_shared_state_tDWORD_ALIGNEDaec_shared_state;uint8_tDWORD_ALIGNEDaec_mem[sizeof(aec_memory_pool_t)];uint8_tDWORD_ALIGNEDaec_shadow_mem[sizeof(aec_shadow_filt_memory_pool_t)];unsignedy_chans=2,x_chans=2;unsignedmain_phases=10,shadow_phases=5;// There is one main and one shadow filter per x-y channel pair, so for this example there will be 4 main and 4// shadow filters. Each main filter will have 10 phases and each shadow filter will have 5 phases.aec_init(&main_state,&shadow_state,&shared_state,aec_mem,aec_shadow_mem,y_chans,x_chans,main_phases,shadow_phases);
Parameters:
main_state – [inout] AEC state structure for holding main filter specific state
shadow_state – [inout] AEC state structure for holding shadow filter specific state
shared_state – [inout] Shared state structure for holding state that is common to main and shadow filter
main_mem_pool – [inout] Memory pool containing main filter memory buffers
shadow_mem_pool – [inout] Memory pool containing shadow filter memory buffers
num_y_channels – [in] Number of mic input channels
num_x_channels – [in] Number of reference input channels
num_main_filter_phases – [in] Number of phases in the main filter
num_shadow_filter_phases – [in] Number of phases in the shadow filter
Initialise AEC data structures for processing a new frame.
This is the first function that is called when a new frame is available for processing. It takes the new samples as input and combines the new samples and previous frame’s history to create a processing block on which further processing happens. It also initialises some data structures that need to be initialised at the beginning of a frame.
Note
y_data and x_data buffers memory is free to be reused after this function call.
This function calculates the energy of frequency domain data used in the AEC. Frequency domain data in AEC is in the form of complex 32bit vectors and energy is calculated as the squared magnitude of the input vector.
Calculate Discrete Fourier Transform (DFT) spectrum of an input time domain vector.
This function calculates the spectrum of a real 32bit time domain vector. It calculates an N point real DFT where N is the length of the input vector to output a complex N/2+1 length complex 32bit vector. The N/2+1 complex output values represent spectrum samples from DC up to the Nyquist frequency.
The DFT calculation is done in place. After this function call the input and output BFP structures data fields point to the same memory. Since DFT is calculated in place, use of the input BFP struct is undefined after this function.
To allow for inplace transform from N real 32bit values to N/2+1 complex 32bit values, the input vector should have 2 extra real 32bit samples worth of memory. This means that input->data should point to a buffer of length input->length+2
After this function input->data and output->data point to the same memory address.
Calculate inverse Discrete Fourier Transform (DFT) of an input spectrum.
This function calculates a N point inverse real DFT of a complex 32bit where N is 2*(length-1) where length is the length of the input vector. The output is a real 32bit vector of length N.
The inverse DFT calculation is done in place. After this operation the input and the output BFP structures data fields point to the same memory. Since the calculation is done in place, use of input BFP struct after this function is undefined.
After this function input->data and output->data point to the same memory address.
XFIFO is a FIFO of the most recent X frames, where X is spectrum of one frame of reference input. There’s a common X FIFO that is shared between main and shadow filters. It holds num_main_filter_phases most recent X frames and the shadow filter uses num_shadow_filter_phases most recent frames out of it.
This function calculates the energy per X sample index summed across the X FIFO phases. This function also calculates the maximum energy across all samples indices of the output energy vector
Note
This function implements some speed optimisations which introduce quantisation error. To stop quantisation error build up, in every call of this function, energy for one sample index, which is specified in the recalc_bin argument, is recalculated without the optimisations. There are a total of AEC_FD_FRAME_LENGTH samples in the energy vector, so recalc_bin keeps cycling through indexes 0 to AEC_PROC_FRAME_LENGTH/2.
Parameters:
state – [inout] AEC state. state->X_energy[ch] and state->max_X_energy[ch] are updated
ch – [in] channel index for which energy calculations are done
recalc_bin – [in] The sample index for which energy is recalculated to eliminate quantisation errors
This function updates the X FIFO by removing the oldest X frame from it and adding the current X frame to it. This function also calculates sigmaXX which is the exponential moving average of the current X frame energy
Parameters:
state – [inout] AEC state structure. state->shared_state->X_fifo[ch] and state->shared_state->sigma_XX[ch] are updated.
ch – [in] X channel index for which to update X FIFO
Calculate error spectrum and estimated mic signal spectrum.
This function calculates the error spectrum (Error) and estimated mic input spectrum (Y_hat) Y_hat is calculated as the sum of all phases of the adaptive filter multiplied by the respective phases of the reference input spectrum. Error is calculated by subtracting Y_hat from the mic input spectrum Y
Parameters:
state – [inout] AEC state structure. state->Error[ch] and state->Y_hat[ch] are updated
ch – [in] mic channel index for which to compute Error and Y_hat
This function calculates the average coherence between mic input signal (y) and estimated mic signal (y_hat). A metric is calculated using y and y_hat and the moving average (coh) and a slow moving average (coh_slow) of that metric is calculated. The coherence values are used to distinguish between situations when filter adaption should continue or freeze and update mu accordingly.
Parameters:
state – [inout] AEC state structure. state->shared_state->coh_mu_state[ch].coh and state->shared_state->coh_mu_state[ch].coh_slow are updated
ch – [in] mic channel index for which to calculate average coherence
This function is responsible for windowing the filter error signal and creating AEC filter output that can be propagated to downstream stages. output is calculated by overlapping and adding current frame’s windowed error signal with the previous frame windowed error. This is done to smooth discontinuities in the output as the filter adapts.
Parameters:
state – [inout] AEC state structure. state->error[ch]
output – [out] pointer to the output buffer
ch – [in] mic channel index for which to calculate output
This function calculates the normalisation spectrum of the reference input signal. This normalised spectrum is later used during filter adaption to scale the adaption to the size of the input signal. The normalisation spectrum is calculated as a time and frequency smoothed energy of the reference input spectrum.
The normalisation spectrum is calculated differently for main and shadow filter, so a flag indicating whether this calculation is being done for the main or shadow filter is passed as an input to the function
Parameters:
state – [inout] AEC state structure. state->inv_X_energy[ch] is updated
ch – [in] reference channel index for which to calculate normalisation spectrum
is_shadow – [in] flag indicating filter type. 0: Main filter, 1: Shadow filter
Compare and update filters. Calculate the adaption step size mu.
This function has 2 responsibilities. First, it compares the energies in the error spectrums of the main and shadow filter with each other and with the mic input spectrum energy, and makes an estimate of how well the filters are performing. Based on this, it optionally modifies the filters by either resetting the filter coefficients or copying one filter into another. Second, it uses the coherence values calculated in aec_calc_coherence as well as information from filter comparison done in step 1 to calculate the adaption step size mu.
Parameters:
main_state – [inout] AEC state structure for the main filter
shadow_state – [inout] AEC state structure for the shadow filter
This function calculates a parameter referred to as T that is later used to scale the reference input spectrum in the filter update step. T is a function of the adaption step size mu, normalisation spectrum inv_X_energy and the filter error spectrum Error.
Parameters:
state – [inout] AEC state structure. state->T[x_ch] is updated
This function updates the adaptive filter spectrum (H_hat). It calculates the delta update that is applied to the filter by scaling the X FIFO with the T values computed in aec_compute_T() and applies the delta update to H_hat. A gradient constraint FFT is then applied to constrain the length of each phase of the filter to avoid wrapping when calculating y_hat
Parameters:
state – [inout] AEC state structure. state->H_hat[y_ch] is updated
The X FIFO BFP structure is maintained in 2 forms - as a 2 dimensional [x_channels][num_phases] and as a [x_channels * num_phases] 1 dimensional array. This is done in order to optimally access the X FIFO as needed in different functions. After the X FIFO is updated with the current X frame, this function is called in order to copy the 2 dimensional BFP structure into it’s 1 dimensional counterpart.
Parameters:
state – [inout] AEC state structure. state->X_fifo_1d is updated
Calculate a correlation metric between the microphone input and estimated microphone signal.
This function calculates a metric of resemblance between the mic input and the estimated mic signal. The correlation metric, along with reference signal energy is used to infer presence of near and far end signals in the AEC mic input.
Parameters:
state – [in] AEC state structure. state->y and state->y_hat are used to calculate the correlation metric
ch – [in] mic channel index for which to calculate the metric
This function implements a quick check for detecting activity on the input channels. It detects signal presence by checking if the maximum sample in the time domain input frame is above a given threshold.
Parameters:
input_data – [in] Pointer to input data frame. Input is assumed to be in Q1.31 fixed point format.
active_threshold – [in] Threshold for detecting signal activity
num_channels – [in] Number of input data channels
Returns:
0 if no signal activity on the input channels, 1 if activity detected on the input channels
lib_aec is present as part of fwk_voice. Get the latest version of fwk_voice from
https://github.com/xmos/fwk_voice. lib_aec is present within the modules/lib_aec directory in fwk_voice
lib_ns is a library which performs Noise Suppression (NS), by estimating the noise and
subtracting it from frame. lib_ns library functions make use of functionality
provided in lib_xcore_math to perform DSP operations. For more details, refer to NS Overview.
lib_ns is included as part of the fwk_voice github repository and all requirements for cloning
and building fwk_voice apply. lib_ns is compiled as a static library as part of the overall
fwk_voice build. It depends on lib_xcore_math.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Noise Suppression Library$$$Getting and Building£££modules/voice/modules/lib_ns/doc/src/getting_started.html#getting-and-building
This module is part of the parent fwk_voice repo clone. It is compiled as a static library as part of
fwk_voice compilation process.
To include lib_ns in an application as a static library, the generated libfwk_voice_module_lib_ns.a can then be linked
into the application. Add lib_ns/api to the include directories when building the application.
The lib_ns library provides an API to implement Noise
Suppression within an application.
The noise suppressor estimates the probability of speech presence and dynamically
adapts its coefficients to estimate the noise levels to subtract from the input.
The filter will automatically reset its noise estimations every 10 frames.
The NS takes as input a frame of data from an audio channel. This could be the
microphone input or the output of another module in the application.
Noise Suppression is performed on a frame-by-frame basis. Each frame consists of
15ms of data, which is 240 samples at 16kHz input sampling frequency. Input data is
expected to be in a fixed-point 32-bit 1.31 format.
Before processing any frames, the application must configure and initialise the
NS instance by calling ns_init(). Then for each frame,
ns_process_frame() will update the NS instance’s internal state and produce
the output frame by applying the NS algorithm to the input frame.
If multiple channels need to be processed by the application, or multiple outputs
are required, an independent instance of the NS must be run for each channel.
This function initialises the NS state with the provided configuration. It must be called at startup to initialise the NS before processing any frames, and can be called at any time after that to reset the NS instance, returning the internal NS state to its defaults.
This function updates the NS’s internal state based on the input 1.31 frame, and returns an output 1.31 frame containing the result of the NS algorithm applied to the input.
The input and output pointers can be equal to perform the processing in-place.
Length of the frame of data on which the NS will operate.
NS_PROC_FRAME_LENGTH
Time domain samples block length used internally.
NS_PROC_FRAME_BINS
Number of bins of spectrum data computed when doing a DFT of a NS_PROC_FRAME_LENGTH length time domain vector. The NS_PROC_FRAME_BINS spectrum values represent the bins from DC to Nyquist.
NS_INT_EXP
The exponent used internally to keep q1.31 format.
NS_WINDOW_LENGTH
The length of the window applied in time domain
structns_state_t
#include <ns_state.h>
NS state structure.
This structure holds the current state of the NS instance and members are updated each time that ns_process_frame() runs. Many of these members are exponentially-weighted moving averages (EWMA) which influence the behaviour of the NS filter. The user should not directly modify any of these members.
lib_ns is present as part of fwk_voice. Get the latest version of fwk_voice from
https://github.com/xmos/fwk_voice. lib_ns is present within the modules/lib_ns directory in fwk_voice.
To use the functions in this library in an application, include ns_api.h in the application source file.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library£££modules/voice/modules/lib_agc/doc/index.html#automatic-gain-control-library
lib_agc is a library which performs Automatic Gain Control (AGC), with support for Loss Control.
For more details, refer to AGC Overview.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$Repository Structure£££modules/voice/modules/lib_agc/doc/src/getting_started.html#repository-structure
modules/lib_agc - The actual lib_agc library directory within https://github.com/xmos/fwk_voice/.
Within lib_agc
api/ - Headers containing the public API for lib_agc.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$Requirements£££modules/voice/modules/lib_agc/doc/src/getting_started.html#requirements
lib_agc is included as part of the fwk_voice github repository and all requirements for cloning
and building fwk_voice apply. lib_agc is compiled as a static library as part of the overall
fwk_voice build. It depends on lib_xcore_math.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$Getting and Building£££modules/voice/modules/lib_agc/doc/src/getting_started.html#getting-and-building
This module is part of the parent fwk_voice repo clone. It is compiled as a static library as part of
fwk_voice compilation process.
To include lib_agc in an application as a static library, the generated libfwk_voice_module_lib_agc.a can then be linked
into the application. Add lib_agc/api to the include directories when building the application.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$AGC Overview£££modules/voice/modules/lib_agc/doc/src/overview.html#agc-overview
The lib_agc library provides an API to implement Automatic Gain Control within
an application. The goal of the AGC algorithm is to provide consistent output
levels for voice audio.
The gain control can adapt to maintain the amplitude of the peak of the frame
within an upper and lower bound configured for the AGC instance. When used in an
application with a Voice to Noise Ratio estimator (VNR), the AGC will adapt only when
voice activity is detected, so that speech in the input signal is amplified
above other sounds.
The AGC also has a Loss Control feature which can be used when the application
has an Acoustic Echo Canceller (AEC). This feature uses data from the AEC to
adjust the gain applied to reduce residual echoes by attenuating the audio when
near-end speech is not present.
The AGC takes as input a frame of data from an audio channel. This could be the
microphone input or the output of another module in the application.
Gain control is performed on a frame-by-frame basis. Each frame consists of 15ms
of data, which is 240 samples at 16kHz input sampling frequency. Input data is
expected to be in a fixed-point 32-bit 1.31 format.
Before processing any frames, the application must configure and initialise the
AGC instance by calling agc_init(). Then for each frame,
agc_process_frame() will update the AGC instance’s internal state and produce
the output frame by applying the AGC algorithm to the input frame.
The gain values in this module for AGC gain and Loss Control gain are
multiplicative factors that are applied to scale the input frame. Therefore, a
fixed gain value of 1.0 (without loss control) will create no change to the input.
If multiple channels need to be processed by the application, or multiple outputs
are required, an independent instance of the AGC must be run for each channel.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference£££modules/voice/modules/lib_agc/doc/src/reference/index.html#api-reference
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference$$$AGC API Functions£££modules/voice/modules/lib_agc/doc/src/reference/api.html#agc-api-functions
This function initialises the AGC state with the provided configuration. It must be called at startup to initialise the AGC before processing any frames, and can be called at any time after that to reset the AGC instance, returning the internal AGC state to its defaults.
This function updates the AGC’s internal state based on the input frame and meta-data, and returns an output containing the result of the AGC algorithm applied to the input.
The input and output pointers can be equal to perform the processing in-place.
output – [out] Array to return the resulting frame of data
input – [in] Array of frame data on which to perform the AGC
meta_data – [in] Meta-data structure with VNR/AEC data
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference$$$AGC Pre-Defined Profiles£££modules/voice/modules/lib_agc/doc/src/reference/profiles.html#agc-pre-defined-profiles
groupagc_profiles
Defines
AGC_PROFILE_ASR
AGC profile tuned for Automatic Speech Recognition (ASR).
AGC_PROFILE_FIXED_GAIN
AGC profile tuned to apply a fixed gain.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference$$$AGC API Structure Definitions£££modules/voice/modules/lib_agc/doc/src/reference/defines.html#agc-api-structure-definitions
groupagc_defs
Defines
AGC_FRAME_ADVANCE
Length of the frame of data on which the AGC will operate.
AGC_META_DATA_NO_VNR
If the application has no VNR, adapt_on_vnr must be disabled in the configuration. This pre-processor definition can be assigned to the vnr_flag in agc_meta_data_t in that situation to make it clear in the code that there is no VNR.
AGC_META_DATA_NO_AEC
If the application has no AEC, lc_enabled must be disabled in the configuration. This pre-processor definition can be assigned to the aec_ref_power and aec_corr_factor in agc_meta_data_t in that situation to make it clear in the code that there is no AEC.
structagc_config_t
#include <agc_api.h>
AGC configuration structure.
This structure contains configuration settings that can be changed to alter the behaviour of the AGC instance.
Members with the “lc_” prefix are parameters for the Loss Control feature.
Public Members
intadapt
Boolean to enable AGC adaption; if enabled, the gain to apply will adapt based on the peak of the input frame and the upper/lower threshold parameters.
intadapt_on_vnr
Boolean to enable adaption based on the VNR meta-data; if enabled, adaption will always be performed when voice activity is detected. This must be disabled if the application doesn’t have a VNR.
intsoft_clipping
Boolean to enable soft-clipping of the output frame.
Loss control gain to apply when far-end activity only is detected.
structagc_state_t
#include <agc_api.h>
AGC state structure.
This structure holds the current state of the AGC instance and members are updated each time that agc_process_frame() runs. Many of these members are exponentially-weighted moving averages (EWMA) which influence the adaption of the AGC gain or the loss control feature. The user should not directly modify any of these members, except the config.
The current configuration of the AGC. Any member of this configuration structure can be modified and that change will take effect on the next run of agc_process_frame().
EWMA of the far-end correlation for detecting double-talk.
structagc_meta_data_t
#include <agc_api.h>
AGC meta data structure.
This structure holds meta-data about the current frame to be processed, and must be updated to reflect the current frame before calling agc_process_frame().
Public Members
intvnr_flag
Boolean to indicate the detection of voice activity in the current frame.
Correlation factor between the microphone input and the AEC’s estimated microphone signal.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference$$$AGC Header Files£££modules/voice/modules/lib_agc/doc/src/reference/header_files.html#agc-header-files
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference$$$agc_api.h£££modules/voice/modules/lib_agc/doc/src/reference/header_files.html#agc-api-h
pagepage_agc_api_h
This header should be included in application source code to gain access to the lib_agc public functions API.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API Reference$$$agc_profiles.h£££modules/voice/modules/lib_agc/doc/src/reference/header_files.html#agc-profiles-h
pagepage_agc_profiles_h
This header contains pre-defined profiles for AGC configurations. These profiles can be used to initialise the agc_config_t data for use with agc_init().
This header is automatically included by agc_api.h.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$On GitHub£££modules/voice/modules/lib_agc/doc/src/reference/header_files.html#on-github
lib_agc is present as part of fwk_voice. Get the latest version of fwk_voice from
https://github.com/xmos/fwk_voice. lib_agc is present within the modules/lib_agc directory in fwk_voice.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Gain Control Library$$$API£££modules/voice/modules/lib_agc/doc/src/reference/header_files.html#api
To use the functions in this library in an application, include agc_api.h in the application source file.
lib_adec is a library which provides functions for measuring and correcting delay offsets between the reference
and loudspeaker signals.
lib_adec depends on lib_aec and lib_xcore_math libraries. For more details about the ADEC, refer to
ADEC Overview
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Delay Estimation and Correction Library$$$Getting and Building£££modules/voice/modules/lib_adec/doc/src/getting_started.html#getting-and-building
lib_adec is included as part of the fwk_voice github repository
and all requirements for cloning and building fwk_voice apply. lib_adec is compiled as a static library as part of
overall fwk_voice build. To include lib_adec in an application as a static library, the generated libfwk_voice_module_lib_adec.a can then be linked into the application. Be sure to also add lib_adec/api as an include directory for the application.
The ADEC module provides functions to estimate and automatically correct for delay offsets between the reference and the
loudspeakers.
Acoustic echo cancellation is an adaptive filtering process which compares the reference audio to that received from the
microphones. It models the reverberation time of a room, i.e. the time it takes for acoustic reflections to decay to
insignificance. The time window modelled by the AEC is finite, and to maximise its performance it is important to ensure
that the reference audio is presented to the AEC time aligned to the audio being reproduced by the loudspeakers. The
reference audio path delay and the audio reproduction path delay may be significantly different, requiring additional
delay to be inserted into one of the two paths, to correct this delay difference.
The ADEC module provides functionality for
Measuring the current delay
Using the measured delay along with AEC performance related metadata collected from the echo canceller to monitor AEC and make decisions about reconfiguring the AEC and correcting bulk delay offsets.
The metadata collected from AEC contains statistics such as the ERLE, the peak power seen in the adaptive filter and the
peak power to average power ratio of the adaptive filter.
The ADEC algorithm works in 2 modes - normal mode and delay estimation mode.
In its normal mode ADEC monitors the AEC performance and requests small delay corrections. Using the statistics from the AEC, the ADEC estimates a metric called the
AEC goodness which is an estimate of how well the echo canceller is performing. Based on the estimated AEC goodness and the current measured delay, the ADEC can
request for a delay correction to be applied at the input of the echo canceller.
If the AEC is seen as consistently bad, the ADEC transitions to a delay estimation mode and requests for
A special delay to be applied at AEC input that will enable measuring the actual delay in both delay scenarios; microphone input arriving at the AEC earlier in time than the reference input as well as microphone input arriving late in time wrt reference input.
A restart of AEC in a new configuration that has more adaptive filter phases, in order of have a longer filter tail length that is suitable for delay estimation.
Once the ADEC has a measure of the new delay, it requests a delay correction and a reconfiguration of the AEC back to its normal
mode and goes back to its normal mode of monitoring AEC performance and correcting for small delay offsets.
Before processing any frames, the application must configure and initialise the ADEC instance by calling adec_init(). Then for each frame, adec_estimate_delay() will estimate the current delay and adec_process_frame() will use the current frame’s AEC statistics and the estimated delay to monitor the AEC and request possible AEC and delay configuration changes.
This function initialises ADEC state for a given configuration. It must be called at startup to initialise the ADEC data structures before processing any frames, and can be called at any time after that to reset the ADEC instance, returning the internal ADEC state to its defaults.
Example with ADEC configured for delay estimation only at startup
adec_state_tadec_state;adec_config_tadec_conf;adec_conf.bypass=1;// Bypass automatic DE correctionadec_conf.force_de_cycle_trigger=1;// Force a delay correction cycle, so that delay correction happens once after initialisationadec_init(&adec_state,&adec_conf);// Application needs to ensure that adec_state->adec_config.force_de_cycle_trigger is set to 0 after ADEC has requested a transition to delay estimation mode once in order to ensure that delay is corrected only at startup.
Example with ADEC configured for automatic delay estimation and correction
Perform ADEC processing on an input frame of data.
This function takes information about the latest AEC processed frame and the latest measured delay estimate as input, and decides if a delay correction between input microphone and reference signals is required. If a correction is needed, it outputs a new requested input delay, optionally accompanied with a request for AEC restart in a different configuration. It updates the internal ADEC state structure to reflect the current state of the ADEC process.
This function measures the microphone signal delay wrt the reference signal. It does so by looking for the phase with the peak energy among all AEC filter phases and uses the peak energy phase index as the estimate of the microphone delay. Along with the measured delay, it also outputs information about the peak phase energy that can then be used to gauge the AEC filter convergence and the reliability of the measured delay.
Number of frames far we look back to smooth the peak to average filter power ratio history.
ADEC_PEAK_LINREG_HISTORY_SIZE
Number of frames of peak power history we look at while computing AEC goodness metric. Not NOT USER MODIFIABLE.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Automatic Delay Estimation and Correction Library$$$API Reference$$$ADEC Data Structure and Enum definitions£££modules/voice/modules/lib_adec/doc/src/reference/types.html#adec-data-structure-and-enum-definitions
groupadec_types
Enums
enumadec_mode_t
Values:
enumeratorADEC_NORMAL_AEC_MODE
ADEC processing mode where it monitors AEC performance and requests small delay correction.
enumeratorADEC_DELAY_ESTIMATOR_MODE
ADEC processing mode for bulk delay correction in which it measures for a new delay offset.
structadec_config_t
#include <adec_state.h>
ADEC configuration structure.
This is used to provide configuration when initialising ADEC at startup. A copy of this structure is present in the ADEC state structure and available to be modified by the application for run time control of ADEC configuration.
Public Members
int32_tbypass
Bypass ADEC decision making process. When set to 1, ADEC evaluates the current input frame metrics but doesn’t make any delay correction or aec reset and reconfiguration requests
int32_tforce_de_cycle_trigger
Force trigger a delay estimation cycle. When set to 1, ADEC bypasses the ADEC monitoring process and transitions to delay estimation mode for measuring delay offset.
structde_output_t
#include <adec_state.h>
Delay estimator output structure.
Public Members
int32_tmeasured_delay_samples
Estimated microphone delay in time domain samples.
Flag indicating if ADEC is requesting an input delay correction
int32_trequested_mic_delay_samples
Mic delay in samples requested by ADEC. Relevant when delay_change_request_flag is 1. Note that this value is a signed integer. A positive requested_mic_delay_samples requires the microphone to be delayed so the application needs to delay the input mic signal by requested_mic_delay_samples samples. A negative requested_mic_delay_samples means ADEC is requesting the input mic signal to be moved earlier in time. This, the application should do my delaying the input reference signal by abs(requested_mic_delay_samples) samples.
int32_treset_aec_flag
flag indicating ADEC’s request for a reset of part of the AEC state to get AEC filter to start adapting from a 0 filter. ADEC requests this when a small delay correction needs to be applied that doesn’t require a full reset of the AEC.
int32_tdelay_estimator_enabled_flag
Flag indicating if AEC needs to be run configured in delay estimation mode.
int32_trequested_delay_samples_debug
Requested delay samples without clamping to +- MAX_DELAY_SAMPLES. Used only for debugging.
structaec_to_adec_t
#include <adec_state.h>
Input structure containing current frame’s information from AEC.
Flag indicating if there is activity on reference input channels.
structadec_state_t
#include <adec_state.h>
ADEC state structure.
This structure holds the current state of the ADEC instance and members are updated each time that adec_process_frame() runs. Many of these members are statistics from tracking the AEC performance. The user should not directly modify any of these members, except the config.
ADEC’s mode of operation. Can be operating in normal AEC or delay estimation mode.
int32_tgated_milliseconds_since_mode_change
milliseconds elapsed since a delay change was last requested. Used to ensure that delay corrections are not requested too early without allowing enough time for aec filter to converge.
int32_tlast_measured_delay
Last measured delay.
int32_tpeak_power_history_idx
index storing the head of the peak_power_history circular buffer
int32_tpeak_power_history_valid
Flag indicating whether the peak_power_history buffer has been filled at least once.
int32_tsf_copy_flag
Flag indicating if shadow to main filter copy has happened at least once in the AEC.
int32_tconvergence_counter
Counter indicating number of frames the AEC shadow filter has been attempting to converge.
int32_tshadow_flag_counter
Counter indicating number of frame the AEC shadow filter has been better than the main filter.
lib_adec is present as part of fwk_voice. Get the latest version of fwk_voice from
https://github.com/xmos/fwk_voice. lib_adec is present within the modules/lib_adec directory in fwk_voice
lib_ic is a library which provides functions that together perform Interference Cancellation (IC)
on two channel input mic data by adapting to and modelling the room transfer characteristics. lib_ic library functions
make use of functionality provided in lib_aec for the core normalised LMS blocks which in turn uses
lib_xcore_math to perform DSP low-level optimised operations. For more details refer to IC Overview.
lib_ic is included as part of the fwk_voice github repository
and all requirements for cloning and building fwk_voice apply. lib_ic is compiled as a static library as part of
overall fwk_voice build. It depends on lib_aec and lib_xcore_math.
The API is presented as three simple functions. These are initialisation, filtering and adaption. Initialisation is called once
at startup and filtering and adaption is called once per frame of samples. The performance requirement is relative low (around 12MIPS)
and as such is supplied as a single threaded implementation only.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Interference Canceller Library$$$Getting and Building£££modules/voice/modules/lib_ic/doc/src/getting_started.html#getting-and-building
This repo is obtained as part of the parent fwk_voice repo clone. It is
compiled as a static library as part of fwk_voice compilation process.
To include lib_ic in an application as a static library, the generated libfwk_voice_module_lib_ic.a can then be linked into the
application. Be sure to also add lib_ic/api as an include directory for the application.
The Interference Canceller (IC) suppresses static noise from point sources such as cooker hoods, washing machines,
or radios for which there is no reference audio signal available. When the Voice to Noise Ratio estimator (VNR) input
indicates the absence of voice, the IC adapts to remove noise from point sources in the environment. When the VNR
signal indicates the presence of voice, the IC suspends adaptation which allows the voice source to be passed but
maintains suppression of the interfering noise sources which have been previously adapted to.
It can offer much greater, and automatic, cancellation of broad-band noise sources when compared to beam forming
techniques.
It is designed to work at a sample rate of 16kHz and has a fixed configuration of two input microphones and a single
output channel.
The interference canceller is based on an AEC architecture and attempts to cancel one microphone signal from the other in
the absence of voice. In this way, it builds an estimate of the difference in transfer functions between the two
microphones for any present noise sources. Since the transfer function includes spatial information about the noise
sources, applying this filter to the mic input allows any signals originating from the noise source to be cancelled.
The IC uses an adaptive filter which continually adapts to the acoustic environment to accommodate changes in the room
created by events such as doors opening or closing and people moving about. However, it will hold the current transfer
function in the presence of voice meaning it does not adapt to desired audio sources, which can be a person speaking.
The cancellation is performed on a frame by frame basis. Each frame is made of 15msec chunks of data, which is 240
new samples at 16kHz input sampling frequency, per input channel. This is combined with previous audio data to form
a 512 sample frame which allows for sufficient overlap for effective operation of the filter.
The first channel of input microphone data is referred to as y when in time domain and Y when in frequency
domain. The second channel of input microphone data is referred to as x when in time domain and X when in frequency
domain. The y signal is effectively used as the signal containing noise that needs to be cancelled and the x signal
is the reference from which the transfer function is estimated and consequently the noise signal estimated before it
is subtracted from y.
In general throughout the code, names starting with lower case represent time domain and those beginning with
upper case represent frequency domain. For example error is the filter error and Error is the spectrum of
the filter error. The filter coefficient array referred to as h_hat in time domain and H_hat in frequency domain.
The filter has multiple phases each of 15ms. The term phases refers to the tail length of the filter. A filter with more phases or a
longer tail length will be able to model a more reverberant room response leading to better interference cancellation
but, as with all normalised LMS based architectures, will be slower to converge in the case of a transfer function change.
Before starting the IC processing the user must call ic_init() to initialise the IC. If the configuration parameters are
to be set to non-defaults please modify these after ic_init() or in the lib_ic API Definitions file.
Once the IC is initialised, the library functions can be called in a order to perform interference cancellation on
a frame by frame basis.
Initialise IC and VNR data structures and set parameters according to ic_defines.h.
This is the first function that must called after creating an ic_state_t instance.
Parameters:
state – [inout] pointer to IC state structure
Returns:
Error status of the VNR inference engine initialisation that is done as part of ic_init. 0 if no error, one of TfLiteStatus error enum values in case of error.
This should be called once per new frame of IC_FRAME_ADVANCE samples. The y_data array contains the microphone data that is to have the noise subtracted from it and x_data is the noise reference source which is internally delayed before being fed into the adaptive filter. Note that the y_data input array is internally delayed by the call to ic_filter() and so contains the delayed y_data afterwards. Typically it does not matter which mic channel is connected to x or y_data as long as the separation is appropriate. The performance of this filter has been optimised for a 71mm mic separation distance.
Parameters:
state – [inout] pointer to IC state structure
y_data – [inout] array reference of mic 0 input buffer. Modified during call
x_data – [in] array reference of mic 1 input buffer
output – [out] array reference containing IC processed output buffer
Calculate voice to noise ratio estimation for the input and output of the IC.
This function can be called after each call to ic_filter. It will calculate voice to noise ratio which can be used to give information to ic_adapt and to the AGC.
Parameters:
state – [inout] pointer to IC state structure
input_vnr_pred – [inout] voice to noise estimate of the IC input
output_vnr_pred – [inout] voice to noise estimate of the IC output
Adapts the IC filter according to previous frame’s statistics and VNR input.
This function should be called after each call to ic_filter. Filter and adapt functions are separated so that the external VNR can operate on each frame.
Parameters:
state – [inout] pointer to IC state structure
vnr – [in] VNR Voice-to-Noise ratio estimation
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Interference Canceller Library$$$API Reference$$$lib_ic API State Structure£££modules/voice/modules/lib_ic/doc/src/reference/state.html#lib-ic-api-state-structure
groupic_state
Enums
enumadaption_config_e
Values:
enumeratorIC_ADAPTION_AUTO
enumeratorIC_ADAPTION_FORCE_ON
enumeratorIC_ADAPTION_FORCE_OFF
enumcontrol_flag_e
Values:
enumeratorHOLD
enumeratorADAPT
enumeratorADAPT_SLOW
enumeratorUNSTABLE
enumeratorFORCE_ADAPT
enumeratorFORCE_HOLD
structic_config_params_t
#include <ic_state.h>
IC configuration structure.
This structure contains configuration settings that can be changed to alter the behaviour of the IC instance. An instance of this structure is is automatically included as part of the IC state.
It controls the behaviour of the main filter and normalisation thereof. The initial values for these configuration parameters are defined in ic_defines.h and are initialised by ic_init().
Public Members
uint8_tbypass
Boolean to control bypassing of filter stage and adaption stage. When set the delayed y audio samples are passed unprocessed to the output. It is recommended to perform an initialisation of the instance after bypass is set as the room transfer function may have changed during that time.
int32_tgamma_log2
Up scaling factor for X energy calculation used for normalisation.
uint32_tsigma_xx_shift
Down scaling factor for X energy for used for normalisation.
Delta value used in denominator to avoid large values when calculating inverse X energy.
structic_adaption_controller_config_t
#include <ic_state.h>
IC adaption controller configuration structure.
This structure contains configuration settings that can be changed to alter the behaviour of the adaption controller. This includes processing of the raw VNR probability input and optional stability controller logic. It is automatically included as part of the IC state and initialised by ic_init().
The initial values for these configuration parameters are defined in ic_defines.h.
Enum which controls the way mu and leakage_alpha are being adjusted.
structic_adaption_controller_state_t
#include <ic_state.h>
IC adaption controller state structure.
This structure contains state used for the instance of the adaption controller logic. It is automatically included as part of the IC state and initialised by ic_init().
Configuration parameters for the adaption controller.
structic_state_t
#include <ic_state.h>
IC state structure.
This is the main state structure for an instance of the Interference Canceller. Before use it must be initialised using the ic_init() function. It contains everything needed for the IC instance including configuration and internal state of both the filter, adaption logic and adaption controller.
BFP array pointing to the frequency domain T used for adapting the filter coefficients (H). Note there is no associated storage because we re-use the x input array as a memory optimisation.
Initial MU value applied on startup. MU controls the adaption rate of the IC and is normally adjusted by the adaption rate controller during operation.
IC_INIT_EMA_ALPHA
Alpha used for calculating y_ema_energy, x_ema_energy and error_ema_energy.
IC_INIT_LEAKAGE_ALPHA
Alpha used for leaking away H_hat, allowing filter to slowly forget adaption. This value is adjusted by the adaption rate controller if instability is detected.
IC_FILTER_PHASES
The number of filter phases supported by the IC. Each filter phase represents 15ms of filter length. Hence a 10 phase filter will allow cancellation of noise sources with up to 150ms of echo tail length. There is a tradeoff between adaption speed and maximum cancellation of the filter; increasing the number of phases will increase the maximum cancellation at the cost of increased xCORE resource usage and slower adaption times.
IC_Y_CHANNEL_DELAY_SAMPS
This is the delay, in samples that one of the microphone signals is delayed in order for the filter to be effective. A larger number increases the delay through the filter but may improve cancellation. The group delay through the IC filter is 32 + this number of samples.
IC_INIT_SIGMA_XX_SHIFT
Down scaling factor for X energy calculation used for normalisation.
IC_INIT_GAMMA_LOG2
Up scaling factor for X energy calculation for used for LMS normalisation.
IC_INIT_DELTA
Delta value used in denominator to avoid large values when calculating inverse X energy.
IC_INIT_FAST_RATIO_THRESHOLD
Fast ratio threshold to detect instability.
IC_INIT_ENERGY_ALPHA
Alpha for EMA input/output energy calculation.
IC_INIT_HIGH_INPUT_VNR_HOLD_LEAKAGE_ALPHA
Leakage alpha used in case vnr detects high voice probability.
IC_INIT_INSTABILITY_RECOVERY_LEAKAGE_ALPHA
Leakage alpha used in the case where instability is detected. This allows the filter to stabilise without completely forgetting the adaption.
IC_INIT_ADAPT_COUNTER_LIMIT
Limits number of frames for which mu and leakage_alpha could be adapted.
IC_INIT_INPUT_VNR_THRESHOLD
VNR input threshold which decides whether to hold or adapt the filter.
IC_INIT_INPUT_VNR_THRESHOLD_HIGH
VNR high threshold to leak the filter is the speech level is high.
IC_INIT_INPUT_VNR_THRESHOLD_LOW
VNR low threshold to adapt faster when the speech level is low.
IC_INIT_VNR_PRED_ALPHA
Alpha for EMA VNR prediction calculation.
IC_INIT_INPUT_VNR_PRED
Initial value for the input VNR prediction.
IC_INIT_OUTPUT_VNR_PRED
Initial value for the output VNR prediction.
IC_Y_CHANNELS
Number of Y channels input. This is fixed at 1 for the IC. The Y channel is delayed and used to generate the estimated noise signal to subtract from X. In practical terms it does not matter which microphone is X and which is Y. NOT USER MODIFIABLE.
IC_X_CHANNELS
Number of X channels input. This is fixed at 1 for the IC. The X channel is the microphone from which the estimated noise signal is subtracted. In practical terms it does not matter which microphone is X and which is Y. NOT USER MODIFIABLE.
IC_FRAME_LENGTH
Time domain samples block length used internally in the IC’s block LMS algorithm. NOT USER MODIFIABLE.
IC_FRAME_ADVANCE
IC new samples frame size This is the number of samples of new data that the IC works on every frame. 240 samples at 16kHz is 15msec. Every frame, the IC takes in 15msec of mic data and generates 15msec of interference cancelled output. NOT USER MODIFIABLE.
IC_FD_FRAME_LENGTH
Number of bins of spectrum data computed when doing a DFT of a IC_FRAME_LENGTH length time domain vector. The IC_FD_FRAME_LENGTH spectrum values represent the bins from DC to Nyquist. NOT USER MODIFIABLE.
FFT_PADDING
Extra 2 samples you need to allocate in time domain so that the full spectrum (DC to nyquist) can be stored after the in-place FFT. NOT USER MODIFIABLE.
This header contains definitions for data structures used in lib_ic. It also contains the configuration sub-structures which control the operation of the interference canceller during run-time.
lib_ic is present as part of fwk_voice. Get the latest version of fwk_voice from
https://github.com/xmos/fwk_voice. The lib_ic module can be found in the modules/lib_ic directory in fwk_voice.
To use the functions in this library in an application, include ic_api.h in the application source file
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library£££modules/voice/modules/lib_vnr/doc/index.html#voice-to-noise-ratio-estimator-library
lib_vnr is a library which estimates the ratio of speech signal in noise for an input audio stream.
lib_vnr library functions uses lib_xcore_math to perform DSP using low-level optimised operations, and lib_tflite_micro and lib_nn to perform inference using an optimised TensorFlow Lite model.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$Repository Structure£££modules/voice/modules/lib_vnr/doc/src/getting_started.html#repository-structure
modules/lib_vnr - The lib_vnr library directory within https://github.com/xmos/fwk_voice/.
Within lib_vnr:
api/ - Header files containing the public API for lib_vnr.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$Requirements£££modules/voice/modules/lib_vnr/doc/src/getting_started.html#requirements
lib_vnr is included as part of the fwk_voice github repository and all requirements for cloning and building fwk_voice apply. It depends on lib_xcore_math
and the xmos-ai-tools python package.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Structure£££modules/voice/modules/lib_vnr/doc/src/getting_started.html#api-structure
The API is split into 2 parts; feature extraction and inference. The feature extraction API processes an input audio frame to extract features that are input to the inference stage.
The inference API has functions for running inference using the VNR TensorFlow Lite model to predict the speech to noise ratio.
Both feature extraction and inference APIs have initialisation functions that are called only once at device initialisation and processing functions that are called every frame.
The performance requirement is relative low, around 5 MIPS for initialisation and 3 MIPS for processing, and as such is supplied as a single threaded implementation only.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$Getting and Building£££modules/voice/modules/lib_vnr/doc/src/getting_started.html#getting-and-building
The VNR estimator module is obtained as part of the parent fwk_voice repo clone. It is present in fwk_voice/modules/lib_vnr
Both feature extraction and the inference parts of lib_vnr can be compiled as static libraries. The application can link against libfwk_voice_module_lib_vnr_features.a
and/or libfwk_voice_module_lib_vnr_inference.a and add lib_vnr/api/features and/or lib_vnr/api/inference and lib_vnr/api/common as include directories.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$VNR Inference Model£££modules/voice/modules/lib_vnr/doc/src/getting_started.html#vnr-inference-model
The VNR estimator module uses a neural network model to predict the SNR of speech in noise for incoming data. The model used is a pre trained TensorFlow Lite model
that has been optimised for the XCORE architecture using the xmos-ai-tools xformer.
The optimised model is compiled as part of the VNR Inference Engine. Changing the model at runtime is not supported.
If changing to a different model, the application needs to generate the model related files and recompile.
This process is automated through the build system, as described below.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$VNR Inference Model$$$Integrating a TensorFlow Lite model into the VNR module£££modules/voice/modules/lib_vnr/doc/src/getting_started.html#integrating-a-tensorflow-lite-model-into-the-vnr-module
To integrate the new TensorFlow Lite model into the VNR module:
Put an unoptimised model into fwk_voice/modules/lib_vnr/python/model/model_output/trained_model.tflite
Rerun the build tool of our choice (make or ninja, for example)
This will use xmos-ai-tools to optimise .tflite model for xcore and generate .cpp and .h files
into fwk_voice/modules/lib_vnr/src/inference/model/. Those generated files will be picked by the build system and compiled into the VNR module.
The process described above only generates an optimised model that would run on a single core.
Any new models replacing the existing one should have the same set of input features,
input and output size, and data types as the existing model.
If changes to the features are made, the feature extraction code must be updated.
Note that the VNR is used to control the IC behavior, and so its performance may also change.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$VNR Overview£££modules/voice/modules/lib_vnr/doc/src/overview.html#vnr-overview
The VNR (Voice to Noise Ratio) estimator predicts the signal to noise ratio of a speech signal in noise, using a pre-trained neural network. The VNR neural network model outputs a value between 0 and 1, with 1 indicating the strongest speech, and 0, the weakest speech compared to noise in a frame of audio data.
The VNR module processes VNR_FRAME_ADVANCE new audio pcm samples every frame. The time domain input is transformed to frequency domain using a 512 point DFT. A MEL filterbank is then applied to compress the DFT output spectrum into fewer data points. The MEL filter outputs of VNR_PATCH_WIDTH most recent frames are normalised and fed as input features to the VNR prediction model which runs an inference over the features to output the VNR estimate value.
VNR estimations can be very helpful in voice processing pipelines. Applications for VNR include intelligent power management, control of adaptive
filters for reducing noise sources and improved performance of AGC (Automatic Gain Control) blocks that provide a more natural listening experience.
The VNR API is split into 2 parts; feature extraction and inference. This is done to allow multiple sets of features to use the same inference engine.
The VNR feature extraction is further split into 2 parts; a function to form the input frame that the feature extraction can run on, and a function to do the actual feature extraction. The function for forming the input frame starts from VNR_FRAME_ADVANCE new pcm samples and creates the DFT output that is used as input to the MEL filterbank. This has been separated from the rest of the feature extraction to support cases where the VNR might be using the DFT output computed in another module for extracting features.
The pre-trained, optimised for XCORE TensorFlow Lite model, that is used for VNR inference has been compiled as part of the VNR inference static library. There’s no support for providing a new model to the inference engine at run time.
Before starting the feature extraction, the user must call vnr_input_state_init() and vnr_feature_state_init() to initialise the form input frame and feature extraction state. Before starting inference, the user must call vnr_inference_init() to initialise the inference engine.
There are no user configurable parameters within the VNR and so no arguments are required and no configuration structures need be tuned.
Once the VNR is initialised, the vnr_form_input_frame(), vnr_extract_features() and vnr_inference() functions should be called on a frame by frame basis.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference£££modules/voice/modules/lib_vnr/doc/src/reference/index.html#api-reference
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$lib_vnr feature extraction API Functions£££modules/voice/modules/lib_vnr/doc/src/reference/api.html#lib-vnr-feature-extraction-api-functions
Create the input frame for processing through the VNR estimator.
This function takes in VNR_FRAME_ADVANCE new samples, combines them with previous frame’s samples to form a VNR_PROC_FRAME_LENGTH samples input frame of time domain data, and outputs the DFT spectrum of the input frame. The DFT spectrum is output in the BFP structure and data memory provided by the user.
The frequency spectrum output from this function is processed through the VNR feature extraction stage.
If sharing the DFT spectrum calculated in some other module, vnr_form_input_frame() is not needed.
input_state – [inout] pointer to the VNR input state structure
X – [out] pointer to a variable of type bfp_complex_s32_t that the user allocates. The user doesn’t need to initialise this bfp variable. After this function, X is updated to point to the DFT output spectrum and can be passed as input to the feature extraction stage.
X_data – [out] pointer to VNR_FD_FRAME_LENGTH values of type complex_s32_t that the user allocates. After this function, the DFT spectrum values are written to this array, and X->data points to X_data memory.
new_x_frame – [in] Pointer to VNR_FRAME_ADVANCE new time domain samples
This function takes in DFT spectrum of the VNR input frame and does the feature extraction. The features are written to the feature_patch BFP structure and feature_patch_data memory provided by the user. The feature output from this function are passed as input to the VNR inference engine.
Parameters:
vnr_feature_state – [inout] Pointer to the VNR feature extraction state structure
feature_patch – [out] Pointer to the bfp_s32_t structure allocated by the user. The user doesn’t need to initialise this BFP structure before passing it to this function. After this function call feature_patch will be updated and will point to the extracted features. It can then be passed to the inference stage.
feature_patch_data – [out] Pointer to the VNR_PATCH_WIDTH * VNR_MEL_FILTERS int32_t values allocated by the user. The extracted features will be written to the feature_patch_data array and the BFP structure’s feature_patch->data will point to this array.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$lib_vnr inference engine API Functions£££modules/voice/modules/lib_vnr/doc/src/reference/api.html#lib-vnr-inference-engine-api-functions
groupvnr_inference_api
Functions
int32_tvnr_inference_init()
Initialise the inference_engine object and load the VNR model into the inference engine.
This function calls lib_tflite_micro functions to initialise the inference engine and load the VNR model into it. It is called once at startup. The memory required for the inference engine object as well as the tensor arena size required for inference is statically allocated as global buffers in the VNR module. The VNR model is compiled as part of the VNR module.
This function invokes the inference engine. It takes in a set of features corresponding to an input frame of data and outputs the VNR prediction value. The VNR output is a single value ranging between 0 and 1 returned in float_s32_t format, with 0 being the lowest SNR and 1 being the strongest possible SNR in speech compared to noise.
Parameters:
vnr_output – [out] VNR prediction value.
features – [in] Input feature vector. Note that this is not passed as a const pointer and the feature memory is overwritten as part of the inference computation.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$lib_vnr #defines common to feature extraction and inference£££modules/voice/modules/lib_vnr/doc/src/reference/common_defines.html#lib-vnr-defines-common-to-feature-extraction-and-inference
groupvnr_defines
Defines
VNR_MEL_FILTERS
Number of filters in the MEL filterbank used in the VNR feature extraction.
VNR_PATCH_WIDTH
Number of frames that make up a full set of features for the inference to run on.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$lib_vnr feature extraction #defines and data structure definitions£££modules/voice/modules/lib_vnr/doc/src/reference/state.html#lib-vnr-feature-extraction-defines-and-data-structure-definitions
groupvnr_features_state
Defines
VNR_PROC_FRAME_LENGTH
Time domain samples block length used internally in VNR DFT computation. NOT USER MODIFIABLE.
VNR_FRAME_ADVANCE
VNR new samples frame size This is the number of samples of new data that the VNR processes every frame. 240 samples at 16kHz is 15msec. NOT USER MODIFIABLE.
VNR_FD_FRAME_LENGTH
Number of bins of spectrum data computed when doing a DFT of a VNR_PROC_FRAME_LENGTH length time domain vector. The VNR_FD_FRAME_LENGTH spectrum values represent the bins from DC to Nyquist. NOT USER MODIFIABLE.
Feature buffer containing the most recent VNR_MEL_FILTERS frames’ MEL frequency spectrum.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$lib_vnr Header Files£££modules/voice/modules/lib_vnr/doc/src/reference/header_files.html#lib-vnr-header-files
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$vnr_features_api.h£££modules/voice/modules/lib_vnr/doc/src/reference/header_files.html#vnr-features-api-h
pagepage_vnr_features_api_h
This header contains lib_vnr features extraction API functions.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$vnr_inference_api.h£££modules/voice/modules/lib_vnr/doc/src/reference/header_files.html#vnr-inference-api-h
pagepage_vnr_inference_api_h
This header contains lib_vnr inference engine API functions.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$vnr_defines.h£££modules/voice/modules/lib_vnr/doc/src/reference/header_files.html#vnr-defines-h
pagepage_vnr_defines_h
This header contains the lib_vnr public #defines that are common to both feature extraction and inference.
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$API Reference$$$vnr_features_state.h£££modules/voice/modules/lib_vnr/doc/src/reference/header_files.html#vnr-features-state-h
pagepage_vnr_features_state_h
This header contains lib_vnr feature extraction related public #defines and data structure definitions
XCORE ® -VOICE Solutions$$$Audio Processing$$$Audio Features$$$Voice To Noise Ratio Estimator Library$$$On GitHub£££modules/voice/modules/lib_vnr/doc/src/reference/header_files.html#on-github
lib_vnr is present as part of fwk_voice. Get the latest version of fwk_voice from
https://github.com/xmos/fwk_voice. The lib_vnr module can be found in the modules/lib_vnr directory in fwk_voice.
XCORE ® -VOICE Solutions$$$Build System User Guide£££modules/rtos/doc/build_system_guide/index.html#build-system-user-guide
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Build System£££modules/rtos/doc/build_system_guide/introduction.html#build-system
This document describes the CMake-based build system used by applications based on the XMOS RTOS framework. The build system is designed so a user does not have to be an expert using CMake. However, some familiarity with CMake is helpful. You can familiarize yourself by reading the CMake Tutorial or CMake documentation. Reviewing these is optional and the reader should feel free to save that for later.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Build System$$$Overview£££modules/rtos/doc/build_system_guide/introduction.html#overview
An xcore RTOS project can be seen as an integration of several modules. For example, for a FreeRTOS application that captures audio from PDM microphones and outputs it to a DAC, there could be the following modules:
Several core modules (for debug prints, etc…)
The FreeRTOS kernel and drivers
PDM microphone array driver for receiving audio samples
I2C driver for configuring the DAC
I2S driver for outputting to the DAC
Application code tying it all together
When a project is compiled, the build system will build all libraries and source files required for the application. For this to happen, your CMakeLists.txt file will need to specify:
It is very common for target link alias libraries, like rtos::freertos in the snippet above, to include common sets of target link libraries. The snippet above could be simplified because the rtos::freertos alias includes many commonly used drivers and peripheral IO libraries as a dependency.
Application target link libraries can be further simplified using existing bsp_configs. These provide their dependent link libraries enabling applications to simplify their target link libraries list. The snippet above could be simplified because the rtos::bsp_config::xcore_ai_explorer alias includes core::general, rtos::freertos, and all required drivers and peripheral IO libraries used by the bsp_config. More information on bsp_configs can be found in the RTOS Programming Guide.
XMOS libraries and frameworks provide several target aliases. Being aware of the Targets will simplify your application CMakeLists.txt.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Example CMakeLists.txt£££modules/rtos/doc/build_system_guide/cmakelists.html#example-cmakelists-txt
CMake is powerful tool that provides the developer a great deal of flexibility in how their projects are built. As a result, CMakeLists.txt files can accomplish the same function in multiple ways.
Below is an example CMakeLists.txt that shows both required and conventional commands for a basic FreeRTOS project. This example can be used as a starting point for your application, but it is recommended to copy a CMakeLists.txt from an XMOS reference design or other example application that closely resembles your application.
## Specify your application sources by globbing the src folderfile(GLOB_RECURSEAPP_SOURCESsrc/*.c)## Specify your application include pathsset(APP_INCLUDESsrc)## Specify your compiler flagsset(APP_COMPILER_FLAGS-Os-report-fxscope-mcmodel=large${CMAKE_CURRENT_SOURCE_DIR}/src/config.xscope${CMAKE_CURRENT_SOURCE_DIR}/XCORE-AI-EXPLORER.xn)## Specify any compile definitionsset(APP_COMPILE_DEFINITIONSconfigENABLE_DEBUG_PRINTF=1PLATFORM_USES_TILE_0=1PLATFORM_USES_TILE_1=1)## Set your link librariesset(APP_LINK_LIBRARIESrtos::bsp_config::xcore_ai_explorer)## Set your link optionsset(APP_LINK_OPTIONS-report${CMAKE_CURRENT_SOURCE_DIR}/XCORE-AI-EXPLORER.xn${CMAKE_CURRENT_SOURCE_DIR}/src/config.xscope)## Create your targets## Create the target for the portion of application code that will execute on tile[0]set(TARGET_NAMEtile0_my_app)add_executable(${TARGET_NAME}EXCLUDE_FROM_ALL)target_sources(${TARGET_NAME}PUBLIC${APP_SOURCES})target_include_directories(${TARGET_NAME}PUBLIC${APP_INCLUDES})target_compile_definitions(${TARGET_NAME}PUBLIC${APP_COMPILE_DEFINITIONS}THIS_XCORE_TILE=0)target_compile_options(${TARGET_NAME}PRIVATE${APP_COMPILER_FLAGS})target_link_libraries(${TARGET_NAME}PUBLIC${APP_LINK_LIBRARIES})target_link_options(${TARGET_NAME}PRIVATE${APP_LINK_OPTIONS})unset(TARGET_NAME)## Create the target for the portion of application code that will execute on tile[1]set(TARGET_NAMEtile1_my_app)add_executable(${TARGET_NAME}EXCLUDE_FROM_ALL)target_sources(${TARGET_NAME}PUBLIC${APP_SOURCES})target_include_directories(${TARGET_NAME}PUBLIC${APP_INCLUDES})target_compile_definitions(${TARGET_NAME}PUBLIC${APP_COMPILE_DEFINITIONS}THIS_XCORE_TILE=1)target_compile_options(${TARGET_NAME}PRIVATE${APP_COMPILER_FLAGS})target_link_libraries(${TARGET_NAME}PUBLIC${APP_LINK_LIBRARIES})target_link_libraries(${TARGET_NAME}PRIVATE${APP_LINK_OPTIONS})unset(TARGET_NAME)## Merge tile[0] and tile[1] binaries into a single binary using an XMOS CMake macromerge_binaries(my_apptile0_my_apptile1_my_app1)## Optionally create run and debug targets using XMOS CMake macroscreate_run_target(my_app)create_debug_target(my_app)
For more information, see the documentation for each of the CMake commands used in the example above.
See Macros for more information on the XMOS CMake macros.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Targets£££modules/rtos/doc/build_system_guide/targets.html#targets
The following library target aliases can be used in your application CMakeLists.txt. An example of how to add aliases to your target link libraries is shown below:
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Targets$$$General£££modules/rtos/doc/build_system_guide/targets.html#general
Several aliases are provided that specify a collection of libraries with similar functions. These composite target libraries provide a concise alternative to specifying all the individual targets that are commonly required.
Composite Target Libraries
Target
Description
core::general
Commonly used core libraries
io::general
Commonly used peripheral libraries
io::audio
Commonly used peripheral libraries for audio applications
rtos::freertos
Commonly used RTOS libraries
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Targets$$$Core£££modules/rtos/doc/build_system_guide/targets.html#core
If you prefer, you can specify individual core library targets.
Core Libraries
Target
Description
framework_core_clock_control
Clock control API
framework_core_utils
General utilities used by most applications
framework_core_legacy_compat
For compatibility with XC
lib_xcore_math
VPU-optimized math library
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Targets$$$Peripherals£££modules/rtos/doc/build_system_guide/targets.html#peripherals
If you prefer, you can specify individual peripheral libraries.
Peripheral Libraries
Target
Description
lib_i2c
I2C library
lib_spi
SPI library
lib_uart
UART library
lib_qspi_io
QSPI library
lib_xud
XUD USB library
lib_i2s
I2S library
lib_mic_array
Microphone Array library
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Targets$$$RTOS£££modules/rtos/doc/build_system_guide/targets.html#rtos
Several aliases are provided that specify a collection of RTOS libraries with similar functions. These composite target libraries provide a concise alternative to specifying all the individual targets that are commonly required.
Composite RTOS Libraries
Target
Description
rtos::freertos
All libraries used my most FreeRTOS applications
rtos::drivers:all
All RTOS Driver libraries
rtos::freertos_usb
All libraries to support development with TinyUSB
rtos::sw_services::general
Most commonly used RTOS software service libraries
rtos::iot
All IoT libraries
rtos::wifi
All WiFi libraries
These board support libraries simplify development with a specific board.
Board Support Libraries
Target
Description
rtos::bsp_config::xcore_ai_explorer
xcore.ai Explorer RTOS board support library
If you prefer, you can specify individual RTOS driver libraries.
Individual RTOS Driver Libraries
Target
Description
rtos::drivers::uart
UART RTOS driver library
rtos::drivers::i2c
I2C RTOS driver library
rtos::drivers::i2s
I2S RTOS driver library
rtos::drivers::spi
SPI RTOS driver library
rtos::drivers::qspi_io
QSPI RTOS driver library
rtos::drivers::mic_array
Microphone Array RTOS driver library
rtos::drivers::usb
USB RTOS driver library
rtos::drivers::dfu_image
RTOS DFU driver library
rtos::drivers::gpio
GPIO RTOS driver library
rtos::drivers::l2_cache
L2 Cache RTOS driver library
rtos::drivers::clock_control
Clock control RTOS driver library
rtos::drivers::trace
Trace RTOS driver library
rtos::drivers::swmem
SwMem RTOS driver library
rtos::drivers::wifi
WiFi RTOS driver library
rtos::drivers::intertile
Intertile RTOS driver library
rtos::drivers::rpc
Remote procedure call RTOS driver library
If you prefer, you can specify individual software service libraries.
Individual Software Service Libraries
Target
Description
rtos::sw_services::fatfs
FatFS library
rtos::sw_services::usb
USB library
rtos::sw_services::device_control
Device control library
rtos::sw_services::usb_device_control
USB device control library
rtos::sw_services::wifi_manager
WiFi manager library
rtos::sw_services::tls_support
TLS library
rtos::sw_services::dhcp
DHCP library
rtos::sw_services::json
JSON library
rtos::sw_services::http
HTTP library
rtos::sw_services::sntpd
SNTP daemon library
rtos::sw_services::mqtt
MQTT library
The following libraries for building host applications are also provided by the SDK.
Host (x86) Libraries
Target
Description
rtos::sw_services::device_control_host_usb
Host USB device control library
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros£££modules/rtos/doc/build_system_guide/macros.html#macros
Several CMake macros and functions are provide to make building for XCORE easier. These macros are located in the file tools/cmake_utils/xmos_macros.cmake and are documented below.
To see what XTC Tools commands the macros and functions are running, add VERBOSE=1 to your build command line. For example:
make run_my_target VERBOSE=1
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros£££modules/rtos/doc/build_system_guide/macros.html#common-macros
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros$$$merge_binaries£££modules/rtos/doc/build_system_guide/macros.html#merge-binaries
merge_binaries combines multiple xcore applications into one by extracting a tile elf and recombining it into another binary. This is used in multitile RTOS applications to enable building unique instances of the FreeRTOS kernel and task sets on a per tile basis.
This macro takes an output target name, a base target, a target containing a tile to merge, and the tile number to merge.
This macro can be called in two ways. The 4 argument version is for when the
application has only 1 node and therefore only the core needs to be specified.
# create target OUT by replacing tile number 0 in BASE with tile 0 in OTHERmerge_binaries(${OUT}${BASE}${OTHER}0)
The 5 argument version is for multi-node applications. IMPORTANT: node number
is not the “Node Id” from the xn file, rather the index of the node in the
JTAGChain which is defined in the xn file.
# create target OUT by replacing tile 1 on node 0 in BASE with tile 1 on# node 0 in OTHERmerge_binaries(${OUT}${BASE}${OTHER}01)
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros$$$create_run_target£££modules/rtos/doc/build_system_guide/macros.html#create-run-target
create_run_target creates a run target for <TARGET_NAME> with xscope output.
create_run_target(<TARGET_NAME>)
create_run_target allows you to run a binary with the following command instead of invoking xrun--xscope.
make run_my_target
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros$$$create_debug_target£££modules/rtos/doc/build_system_guide/macros.html#create-debug-target
create_debug_target creates a debug target for <TARGET_NAME>.
create_debug_target(<TARGET_NAME>)
create_debug_target allows you to debug a binary with the following command instead of invoking xgdb. This target implicitly sets up the xscope debug interface as well.
make debug_my_target
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros$$$create_filesystem_target£££modules/rtos/doc/build_system_guide/macros.html#create-filesystem-target
create_filesystem_target creates a filesystem file for <TARGET_NAME> using the files in the <FILESYSTEM_INPUT_DIR> directory. <IMAGE_SIZE> specifies the size (in bytes) of the filesystem. The filesystem output filename will end in _fat.fs. Optional argument <OPTIONAL_DEPENDS_TARGETS> can be used to specify other dependency targets, such as filesystem generators.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros$$$create_data_partition_directory£££modules/rtos/doc/build_system_guide/macros.html#create-data-partition-directory
create_data_partition_directory creates a directory populated with all components related to the data partition. The data partition output folder will end in _data_partition
Optional argument <OPTIONAL_DEPENDS_TARGETS> can be used to specify other dependency targets.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Common Macros$$$create_flash_app_target£££modules/rtos/doc/build_system_guide/macros.html#create-flash-app-target
create_flash_app_target creates a debug target for <TARGET_NAME> with optional arguments <BOOT_PARTITION_SIZE>, <DATA_PARTITION_CONTENTS>, and <OPTIONAL_DEPENDS_TARGETS>. <BOOT_PARTITION_SIZE> specificies the size in bytes of the boot partition. <DATA_PARTITION_CONTENTS> specifies the optional binary contents of the data partition. <OPTIONAL_DEPENDS_TARGETS> specifies CMake targets that should be dependencies of the resulting create_flash_app_target target. This may be used to create recipes that generate the data partition contents.
create_flash_app_target allows you to flash a factory image binary and optional data partition with the following command instead of invoking xflash.
make flash_app_my_target
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Less Common Macros£££modules/rtos/doc/build_system_guide/macros.html#less-common-macros
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Less Common Macros$$$create_install_target£££modules/rtos/doc/build_system_guide/macros.html#create-install-target
create_install_target creates an install target for <TARGET_NAME>.
create_install_target(<TARGET_NAME>)
create_install_target will copy <TARGET_NAME>.xe to the ${PROJECT_SOURCE_DIR}/dist directory.
make install_my_target
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Less Common Macros$$$create_run_xscope_to_file_target£££modules/rtos/doc/build_system_guide/macros.html#create-run-xscope-to-file-target
create_run_xscope_to_file_target creates a run target for <TARGET_NAME>. <XSCOPE_FILE> specifies the file to save to (no extension).
create_run_xscope_to_file_target allows you to run a binary with the following command instead of invoking xrun--xscope-file.
make run_xscope_to_file_my_target
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Less Common Macros$$$create_upgrade_img_target£££modules/rtos/doc/build_system_guide/macros.html#create-upgrade-img-target
create_upgrade_img_target creates an xflash image upgrade target for a provided binary for use in DFU
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Less Common Macros$$$create_erase_all_target£££modules/rtos/doc/build_system_guide/macros.html#create-erase-all-target
create_erase_all_target creates an xflash erase all target for <TARGET_FILEPATH> target XN file. The full filepath must be specified for XN file
create_erase_all_target allows you to erase flash with the following command instead of invoking xflash.
make erase_all_my_target
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Macros$$$Less Common Macros$$$query_tools_version£££modules/rtos/doc/build_system_guide/macros.html#query-tools-version
query_tools_version populates the following CMake variables:
XMOS Ltd. is the owner or licensee of this design, code, or Information (collectively, the “Information”) and is providing it to you “AS IS” with no warranty of any kind, express or implied and shall have no liability in relation to its use. XMOS Ltd makes no representation that the Information, or any particular implementation thereof, is or will be free from any claims of infringement and again, shall have no liability in relation to any such claims.
XMOS, XCORE, VocalFusion and the XMOS logo are registered trademarks of XMOS Ltd. in the United Kingdom and other countries and may not be used without written permission. Company and product names mentioned in this document are the trademarks or registered trademarks of their respective owners.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Licenses£££modules/rtos/doc/shared/legal.html#licenses
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Licenses$$$XMOS£££modules/rtos/doc/shared/legal.html#xmos
All original source code is licensed under the XMOS License.
XCORE ® -VOICE Solutions$$$Build System User Guide$$$Licenses$$$Third-Party£££modules/rtos/doc/shared/legal.html#third-party
Additional third party code is included under the following copyrights and licenses:
The xcore platform provides a range of powerful, flexible and economic crossover processors for the use in wide-ranging applications. The XCore platform provides:
At the heart of the platform, the Architecture & Hardware Guide describes the multicore processors. Multiple xcore processors can themselves be “networked” together with seamless communications.
The Programming Guide describes how logical cores of an xcore processor can act independently to behave like highly responsive hardware peripherals, or can work as a team to apply all available CPU cycles onto a single compute task.
The xcore processors are accompanied by the XTC Tools. As well as providing a powerful toolchain for application development, the toolkit assists with application deployment and upgrade.
Traditionally, xcore multi-core processors have been programmed using the XC language. The XC language allows the programmer to statically place tasks on the available hardware cores and wire them together with channels to provide inter-process communication. The XC language also exposes “events,” which are unique to the xcore architecture and are a useful alternative to interrupts.
Using the XC language, it is possible to write dedicated application software with deterministic timing and very low latency between I/O and tasks.
While XC elegantly enables the intrinsic, unique capabilities of the xcore architecture, there often needs to be higher level application type software running alongside it. The programming model that makes the lower level deterministic software possible may not be best suited for many higher level parts of an application that do not require deterministic timing. Where strict real-time execution is not required, higher level abstractions can be used to manage finite hardware resources, and provide a more familiar programming environment.
A symmetric multiprocessing (SMP) real time operating system (RTOS) can be used to simplify xcore application designs, as well as to preserve the hard real-time benefits provided by the xcore architecture for the lower level software functions that require it.
This document assumes familiarity with real time operating systems in general. Familiarity with FreeRTOS specifically should not be required, but will be helpful. For current up to date documentation on FreeRTOS see the documentation section on the FreeRTOS website.
To support this new programming model for xcore, XMOS has extended the popular and free FreeRTOS kernel to support SMP. This allows for the kernel’s scheduler to be started on any number of available xcore logical cores per tile, leaving the remaining free to support other program elements that combine to create complete systems. Once the scheduler is started, FreeRTOS threads are placed on cores dynamically at runtime, rather than statically at compile time. All the usual FreeRTOS rules for thread scheduling are followed, except that rather than only running the single highest priority thread that is ready at any given time, multiple threads may run simultaneously. The threads chosen to run are always the highest priority threads that are ready. When there are more threads of a single priority that are ready to run than the number of cores available, they are scheduled in a round robin fashion. Dynamic scheduling allows FreeRTOS to optimize physical core usage based on priority and availability at runtime, opening up the potential for using tile wide MIPs more efficiently than what could be manually specified in a static compile time setting.
One of xcore’s primary strengths is its guarantee of deterministic behavior and timing. RTOS threads can also benefit from this determinism provided by the xcore architecture. An RTOS thread with interrupts disabled and a high enough priority behaves just as a bare-metal thread. An SMP RTOS kernel does not need to preempt a high priority thread because it has many other cores to utilize to schedule lower priority threads. Using an SMP RTOS allows developers to concentrate on specific requirements of their application without worrying about what affect they might have on non-preemptable thread response times. Furthermore, modification of the program in the future is much easier because the developer does not have to worry about affecting existing responsiveness with changes in unrelated areas. The non-preemptable threads will not be effected by adding lower-priority functionality.
Another xcore strength is it’s performance. xcore.ai provides lightning fast general purpose compute, AI acceleration, powerful DSP and instantaneous I/O control. RTOS threads can also benefit from the performance provided by the xcore architecture, allowing an application developer to dynamically shift performance usage from one application feature to another.
The standard FreeRTOS kernel supports dynamic task priorities, while the FreeRTOS-SMP kernel adds the following additional APIs:
vTaskCoreAffinitySet
vTaskCoreAffinityGet
vTaskPreemptionDisable
vTaskPreemptionEnable
Together, these API enable a developer to take full advantage of xcore’s performance.
Some additional configuration options are also available to the FreeRTOS-SMP Kernel:
To further leverage the xcore hardware and the FreeRTOS programming model, XMOS provides support for asymmetric multiprocessing (AMP) per tile. Each XMOS chip contains at least two tiles, which consist of their own set of logical xcore cores, IO, memory space, and more. XMOS provides a build method and variety of software drivers to allow an application to be created that is an AMP system containing, multiple SMP FreeRTOS kernels.
To help ease development of xcore applications using an SMP RTOS, XMOS provides several SMP RTOS compatible drivers. These include, but are not necessarily limited to:
Documentation on each of these drivers can be found under the RTOS Drivers section in the RTOS framework documentation pages.
It is worth noting that most of these drivers utilize a lightweight RTOS abstraction layer, meaning that they are not dependent on FreeRTOS. Conceivably they should work on any SMP RTOS, provided an abstraction layer for it is provided. This abstraction layer is found under the path modules/rtos/modules/osal. At the moment the only available SMP RTOS for xcore is the XMOS SMP FreeRTOS, but more may become available in the future.
The RTOS framework also includes some higher level RTOS compatible software services, some of which call the aforementioned drivers. These include, but are not necessarily limited to:
DHCP server
FAT filesystem
HTTP parser
JSON parser
MQTT client
SNTP client
TLS
USB stack
WiFi connection manager
Documentation on several software services can be found under the RTOS Services section in the RTOS framework documentation pages.
This document is intended to help you start your first FreeRTOS application on xcore. We assume you have read FreeRTOS Application Programming and that you are familiar with FreeRTOS.
A fully functional example application that can be found in the RTOS framework under the path examples/freertos/explorer_board. This application is a reference for how to use an RTOS drivers or software service, and serves as an example for how to structure an SMP RTOS application for xcore. Additional code to initialize the SoC platform for this example is provided by a board support configuration library modules/rtos/modules/board_support/XCORE-AI-EXPLORER_2V0/platform
This example application runs two instances of SMP FreeRTOS, one on each of the processor’s two tiles. Because each tile has its own memory which is not shared between them, this can be viewed as a single asymmetric multiprocessing (AMP) system that comprises two SMP systems. A FreeRTOS thread that is created on one tile will never be scheduled to run on the other tile. Similarly, an RTOS object that is created on a tile, such as a queue, can only be accessed by threads and ISRs that run on that tile and never by code running on the other tile.
That said, the example application is programmed and built as a single coherent application, which will be familiar to programmers who have previously programmed for the xcore in the XC programming language. Data that must be shared between threads running on different tiles is sent via a channel using the RTOS intertile driver, which under the hood uses a streaming channel between the tiles.
Most of the I/O interface drivers in fact provide a mechanism to share driver instances between tiles that utilizes this intertile driver. For those familiar with XC programming, this can be viewed as a C alternative to XC interfaces.
For example, a SPI interface might be available on tile 0. Normally, initialization code that runs on tile 0 sets this interface up and then starts the driver. Without any further initialization, code that runs on tile 1 will be unable to access this interface directly, due both to not having direct access to tile 0’s memory, as well as not having direct access to tile 0’s ports. The drivers, however, provide some additional initialization functions that can be used by the application to share the instance on tile 0 with tile 1. After this initialization is done, code running on tile 1 may use the instance with the same driver API as tile 0, almost as if it was actually running on tile 0.
The example application referenced above, as well as the RTOS driver documentation, should be consulted to see exactly how to initialize and share driver instances. Additionally, not all IO is capable of being shared between tiles directly through the driver API due to timing constraints.
The RTOS framework provides the ON_TILE(t) preprocessor macro. This macro may be used by applications to ensure certain code is included only on a specific tile at compile time. In the example application, there is a single task that is created on both tiles that starts the drivers and creates the remaining application tasks. While this function is written as a single function, various parts are inside #if ON_TILE() blocks. For example, consider the following code snippet found inside the i2c_init() function:
When this function is compiled for tile I2C_TILE_NO, only the first block is included. When it is compiled for the other tile, only the second block is included. When the application is run, tile I2C_TILE_NO performs the initialization of the the I2C master driver host, while the other tile initializes the I2C master driver client. Because the I2C driver instance is shared between the two tiles, it may in fact be set to either zero or one, providing a demonstration of the way that drivers instances may be shared between tiles.
The RTOS framework provides a single XC file that provides the main() function. This provided main() function calls main_tile0() through main_tile3(), depending on the number of tiles that the application requires and the number of tiles provided by the target xcore processor. The application must provide each of these tile entry point functions. Each one is provided with up to three channel ends that are connected to each of the other tiles.
The example application provides both main_tile0() and main_tile1(). Each one calls a common initialization function that initializes all the drivers for the interfaces specific to its tile. These functions also call the initialization functions to share these driver instances between the tiles. These initialization functions are found in the platform/platform_init.c source file.
Each tile then creates the startup_task() task and starts the FreeRTOS scheduler. The startup_task() completes the driver instance sharing and then starts all of the driver instances. The driver startup functions are found in the platform/platform_start.c source file.
Consult the RTOS driver documentation for the details on what exactly each of the RTOS API functions called by this application does.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$Tutorials$$$RTOS Application Design$$$Board Support Configurations£££modules/rtos/doc/programming_guide/tutorials/application_design.html#board-support-configurations
xcore leverages its architecture to provide a flexible chip where many typically silicon based peripherals are found in software. This allows a chip to be reconfigured in a way that provides the specific IO required for a given application, thus resulting in a low cost yet incredibly silicon efficient solution. Board support configurations (bsp_configs) are the description for the hardware IO that exists in a given board. The bsp_configs provide the application programmer with an API to initialize and start the hardware configuration, as well as the supported RTOS driver contexts. The programming model in this FreeRTOS architecture is:
.xn files provide the mapping of ports, pins, and links
bsp_configs specify, setup, and start hardware IO and provide the application with RTOS driver contexts
applications use the bsp_config init/start code as well as RTOS driver contexts, similar to conventional microcontroller programming models.
To support any generic bsp_config, applications should call platform_init() before starting the scheduler, and then platform_start() after the scheduler is running and before any RTOS drivers are used.
The bsp_configs provided with the RTOS framework in modules/rtos/modules/bsp_config are an excellent starting point. They provide the most common peripheral drivers that are supported by the boards that support RTOS framework based applications. For advanced users, it is recommended that you copy one of these bsp_config into your application project and customize as needed.
custom_config.cmake provides the CMake target of the configuration. This target should link the required RTOS framework libraries to support the configuration it defines.
custom_config_xn_file.xn provides various hardware parameters including but not limited to the chip package, IO mapping, and network information.
platform_conf.h provides default configuration of all header defined configuration macros. These may be overridden by compile definitions or application headers.
driver_instances.h provides the declaration of all RTOS drivers in the configuration. It may define XCORE hardware resources, such as ports and clockblocks. It may also define tile placements.
driver_instances.c provides the definition of all RTOS drivers in the configuration.
platform_init.h provides the declaration of platform_init(chanend_t other_tile_c) and platform_start(void)
platform_init.c provides the initialization of all drivers defined in the configuration through the definition of platform_init(chanend_t other_tile_c). This code is run before the scheduler is started and therefore will not be able to access all RTOS driver functionalities nor kernel objects.
platform_start.c provides the starting of all drivers defined in the configuration through the definition of platform_start(void). It may also perform any initialization setup, such as configuring the app_pll or setting up an on board DAC. This code is run once the kernel is running and is therefore subject to preemption and other dynamic scheduling SMP programming considerations.
One of these features if the -report option, which will Display a summary of resource usage. One of the outputs of this report is memory usage, split into the stack, code, and data requirements of the program. Unlike most XC applications, FreeRTOS makes heavy use of dynamic memory allocation. The FreeRTOS heap will appear as Data in the XTC Tools report. The heap size is determined by the compile time definition configTOTAL_HEAP_SIZE, which can be found in an application’s FreeRTOSConfig.h.
For AMP SMP FreeRTOS builds, which are created using the cmake macro merge_binaries(), there are actually multiple application builds, one per tile, which are then combined. While building a given AMP application, the console output will contain both of the individual tile build reports.
As an example, consider building the example_freertos_explorer_board target.
Because the tile 1 portion of the tile1 target build replaces the tile 1 portion in the tile0 target build.
The XTC Tools also provide a method to examine the resource usage of a binary post build. This method will only work if used on the intermediate binaries.
Note: Because the resulting example_freertos_explorer_board.xe binary was created by merging into tile0_example_freertos_explorer_board.xe, the results of xobjdump –resources example_freertos_explorer_board.xe will be the exact same as xobjdump –resources tile0_example_freertos_explorer_board.xe and not account for the actual tile 1 requirements.
Applications using the RTOS Framework are built using CMake. The RTOS framework provides many libraries, drivers and software services, all of which can be included by the application’s CMakeLists.txt file. The application’s CMakeLists can specify precisely which drivers and software services within the SDK should be included through the use of various CMake target aliases.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$Tutorials$$$Board Support Configurations£££modules/rtos/doc/programming_guide/tutorials/bsp_config.html#board-support-configurations
xcore leverages its architecture to provide a flexible chip where many typically silicon based peripherals are found in software. This allows a chip to be reconfigured in a way that provides the specific IO required for a given application, thus resulting in a low cost yet incredibly silicon efficient solution. Board support configurations (bsp_configs) are the description for the hardware IO that exists in a given board. The bsp_configs provide the application programmer with an API to initialize and start the hardware configuration, as well as the supported RTOS driver contexts. The programming model in this FreeRTOS architecture is:
.xn files provide the mapping of ports, pins, and links
bsp_configs specify, setup, and start hardware IO and provide the application with RTOS driver contexts
applications use the bsp_config init/start code as well as RTOS driver contexts, similar to conventional microcontroller programming models.
To support any generic bsp_config, applications should call platform_init() before starting the scheduler, and then platform_start() after the scheduler is running and before any RTOS drivers are used.
The bsp_configs provided with the RTOS framework in modules/rtos/modules/bsp_config are an excellent starting point. They provide the most common peripheral drivers that are supported by the boards that support RTOS framework based applications. For advanced users, it is recommended that you copy one of these bsp_config into your application project and customize as needed.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$Tutorials$$$Board Support Configurations$$$Creating Custom bsp_configs£££modules/rtos/doc/programming_guide/tutorials/bsp_config.html#creating-custom-bsp-configs
To enable hardware portability, a minimal bsp_config should contain the following:
custom_config.cmake provides the CMake target of the configuration. This target should link the required RTOS framework libraries to support the configuration it defines.
custom_config_xn_file.xn provides various hardware parameters including but not limited to the chip package, IO mapping, and network information.
platform_conf.h provides default configuration of all header defined configuration macros. These may be overridden by compile definitions or application headers.
driver_instances.h provides the declaration of all RTOS drivers in the configuration. It may define XCORE hardware resources, such as ports and clockblocks. It may also define tile placements.
driver_instances.c provides the definition of all RTOS drivers in the configuration.
platform_init.h provides the declaration of platform_init(chanend_tother_tile_c) and platform_start(void)
platform_init.c provides the initialization of all drivers defined in the configuration through the definition of platform_init(chanend_tother_tile_c). This code is run before the scheduler is started and therefore will not be able to access all RTOS driver functionalities nor kernel objects.
platform_start.c provides the starting of all drivers defined in the configuration through the definition of platform_start(void). It may also perform any initialization setup, such as configuring the app_pll or setting up an on board DAC. This code is run once the kernel is running and is therefore subject to preemption and other dynamic scheduling SMP programming considerations.
This driver provides the application with the boot partition and data partition layout of the flash used by the second stage bootloader. The driver provides a subset of the functionality of libquadflash enabling the application to use any transport method and the RTOS qspi flash driver to read the factory image, read/write a single upgrade image, and read/write the data partition.
unsignedaddr=rtos_dfu_image_get_factory_addr(dfu_image_ctx);unsignedsize=rtos_dfu_image_get_factory_size(dfu_image_ctx);unsignedchar*buf=pvPortMalloc(sizeof(unsignedchar)*size);rtos_qspi_flash_read(qspi_flash_ctx,(uint8_t*)buf,addr,size);// buf now contains the factory image contents
It is advised to perform this operation in blocks rather than full image size to reduce memory usage. Once the buffer is populated from flash, it can be sent over the desired transport method, such as USB, I2C, etc.
unsignedaddr=rtos_dfu_image_get_upgrade_addr(dfu_image_ctx);unsignedsize=rtos_dfu_image_get_upgrade_size(dfu_image_ctx);unsignedchar*buf=pvPortMalloc(sizeof(unsignedchar)*size);rtos_qspi_flash_read(qspi_flash_ctx,(uint8_t*)buf,addr,size);// buf now contains the upgrade image contents
It is advised to perform this operation in blocks rather than full image size to reduce memory usage. Once the buffer is populated from flash, it can be sent over the desired transport method, such as USB, I2C, etc.
// Assuming buf contains the image data// and size contains the size in bytesunsignedaddr=rtos_dfu_image_get_upgrade_addr(dfu_image_ctx);unsigneddata_partition_base_addr=rtos_dfu_image_get_data_partition_addr(dfu_image_ctx);unsignedbytes_avail=data_partition_base_addr-addr;size_tsector_size=rtos_qspi_flash_sector_size_get(qspi_flash_ctx);if(size<bytes_avail){unsignedchar*tmp_buf=pvPortMalloc(sizeof(unsignedchar)*sector_size);unsignedcur_offset=0;do{unsignedlength=(size-(cur_offset-addr))>=sector_size?sector_size:(size-(cur_offset-addr));rtos_qspi_flash_lock(qspi_flash_ctx);{rtos_qspi_flash_read(qspi_flash_ctx,tmp_buf,addr+cur_offset,sector_size);memcpy(tmp_buf,data+cur_offset,length);rtos_qspi_flash_erase(qspi_flash_ctx,addr+cur_offset,sector_size);rtos_qspi_flash_write(qspi_flash_ctx,(uint8_t*)tmp_buf,addr+cur_offset,sector_size);}rtos_qspi_flash_unlock(qspi_flash_ctx);cur_offset+=length;}while(cur_offset<(size-1));vPortFree(tmp_buf);}else{rtos_printf("Insufficient space for upgrade image\n");}
It is advised to perform this operation in blocks rather than full image size to reduce memory usage. The buffer can be populated over the desired transport method, such as USB, I2C, etc.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$Tutorials$$$RTOS Application DFU$$$Reading the Data Partition Image£££modules/rtos/doc/programming_guide/tutorials/application_dfu_usage.html#reading-the-data-partition-image
To read back the data partition image:
unsignedaddr=rtos_dfu_image_get_data_partition_addr(dfu_image_ctx);unsignedsize=rtos_qspi_flash_size_get(qspi_flash_ctx);unsignedchar*buf=pvPortMalloc(sizeof(unsignedchar)*size);rtos_qspi_flash_read(qspi_flash_ctx,(uint8_t*)buf,addr,size);// buf now contains the data partition image contents
It is advised to perform this operation in blocks rather than full image size to reduce memory usage. The data partition will likely be too large to read into SRAM in a read single operation. Once the buffer is populated from flash, it can be sent over the desired transport method, such as USB, I2C, etc.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$Tutorials$$$RTOS Application DFU$$$Writing the Data Partition Image£££modules/rtos/doc/programming_guide/tutorials/application_dfu_usage.html#writing-the-data-partition-image
To overwrite the current data partition image:
// Assuming buf contains the image data// and size contains the size in bytesunsignedaddr=rtos_dfu_image_get_data_partition_addr(dfu_image_ctx);unsignedend_addr=rtos_qspi_flash_size_get(qspi_flash_ctx);unsignedbytes_avail=end_addr-addr;size_tsector_size=rtos_qspi_flash_sector_size_get(qspi_flash_ctx);if(size<bytes_avail){unsignedchar*tmp_buf=pvPortMalloc(sizeof(unsignedchar)*sector_size);unsignedcur_offset=0;do{unsignedlength=(size-(cur_offset-addr))>=sector_size?sector_size:(size-(cur_offset-addr));rtos_qspi_flash_lock(qspi_flash_ctx);{rtos_qspi_flash_read(qspi_flash_ctx,tmp_buf,addr+cur_offset,sector_size);memcpy(tmp_buf,data+cur_offset,length);rtos_qspi_flash_erase(qspi_flash_ctx,addr+cur_offset,sector_size);rtos_qspi_flash_write(qspi_flash_ctx,(uint8_t*)tmp_buf,addr+cur_offset,sector_size);}rtos_qspi_flash_unlock(qspi_flash_ctx);cur_offset+=length;}while(cur_offset<(size-1));vPortFree(tmp_buf);}else{rtos_printf("Insufficient space for data partition image\n");}
It is advised to perform this operation in blocks rather than full image size to reduce memory usage. The buffer can be populated over the desired transport method, such as USB, I2C, etc.
Starts an RTOS GPIO driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core GPIO driver functions are called with this instance.
rtos_gpio_init() must be called on this GPIO driver instance prior to calling this.
Parameters:
ctx – A pointer to the GPIO driver instance to start.
Initializes an RTOS GPIO driver instance. There should only be one per tile. This instance represents all the GPIO ports owned by the calling tile. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_gpio_start() or any of the core GPIO driver functions with this instance.
Parameters:
ctx – A pointer to the GPIO driver instance to initialize.
RTOS_GPIO_ISR_CALLBACK_ATTR
This attribute must be specified on all RTOS GPIO interrupt callback functions provided by the application.
structrtos_gpio_isr_info_t
#include <rtos_gpio.h>
Struct to hold interrupt state data for GPIO ports.
The members in this struct should not be accessed directly.
structrtos_gpio_struct
#include <rtos_gpio.h>
Struct representing an RTOS GPIO driver instance.
The members in this struct should not be accessed directly.
Configures a port in drive mode. Output values will be driven on the pins. This is the default drive state of a port. This has the side effect of disabling the port’s internal pull-up and pull down resistors.
Parameters:
ctx – A pointer to the GPIO driver instance to use.
Configures a port in drive low mode. When the output value is 0 the pin is driven low, otherwise no value is driven. This has the side effect of enabled the port’s internal pull-up resistor.
Parameters:
ctx – A pointer to the GPIO driver instance to use.
Configures a port in drive high mode. When the output value is 1 the pin is driven high, otherwise no value is driven. This has the side effect of enabled the port’s internal pull-down resistor.
Parameters:
ctx – A pointer to the GPIO driver instance to use.
port_id – The GPIO port to set to drive mode high.
The following functions may be used to share a GPIO driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS GPIO driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_gpio_init(). The host tile that owns the actual instance must simultaneously call rtos_gpio_rpc_host_init().
Parameters:
gpio_ctx – A pointer to the GPIO driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as gpio_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as gpio_ctx.
Performs additional initialization on a GPIO driver instance to allow client tiles to use the GPIO driver instance. Each client tile that will use this instance must simultaneously call rtos_gpio_rpc_client_init().
Parameters:
gpio_ctx – A pointer to the GPIO driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as gpio_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as gpio_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for a GPIO driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_gpio_rpc_client_init(). After calling this, the client tile may immediately begin to call the core GPIO functions on this driver instance. It does not need to wait for the host to call rtos_gpio_start().
gpio_ctx – A pointer to the GPIO driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Starts an RTOS I2C master driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core I2C master driver functions are called with this instance.
rtos_i2c_master_init() must be called on this I2C master driver instance prior to calling this.
Parameters:
i2c_master_ctx – A pointer to the I2C master driver instance to start.
Initializes an RTOS I2C master driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_i2c_master_start() or any of the core I2C master driver functions with this instance.
Parameters:
i2c_master_ctx – A pointer to the I2C master driver instance to initialize.
p_scl – The port containing SCL. This may be either the same as or different than p_sda.
scl_bit_position – The bit number of the SCL line on the port p_scl.
scl_other_bits_mask – A value that is ORed into the port value driven to p_scl both when SCL is high and low. The bit representing SCL (as well as SDA if they share the same port) must be set to 0.
p_sda – The port containing SDA. This may be either the same as or different than p_scl.
sda_bit_position – The bit number of the SDA line on the port p_sda.
sda_other_bits_mask – A value that is ORed into the port value driven to p_sda both when SDA is high and low. The bit representing SDA (as well as SCL if they share the same port) must be set to 0.
tmr – This is unused and should be set to 0. This will be removed.
kbits_per_second – The speed of the I2C bus. The maximum value allowed is 400.
structrtos_i2c_master_struct
#include <rtos_i2c_master.h>
Struct representing an RTOS I2C master driver instance.
The members in this struct should not be accessed directly.
ctx – A pointer to the I2C master driver instance to use.
device_addr – The address of the device to write to.
buf – The buffer containing data to write.
n – The number of bytes to write.
num_bytes_sent – The function will set this value to the number of bytes actually sent. On success, this will be equal to n but it will be less if the slave sends an early NACK on the bus and the transaction fails.
send_stop_bit – If this is non-zero then a stop bit will be sent on the bus after the transaction. This is usually required for normal operation. If this parameter is zero then no stop bit will be omitted. In this case, no other task can use the component until a stop bit has been sent.
Return values:
``I2C_ACK`` – if the write was acknowledged by the device.
ctx – A pointer to the I2C master driver instance to use.
device_addr – The address of the device to read from.
buf – The buffer to fill with data.
n – The number of bytes to read.
send_stop_bit – If this is non-zero then a stop bit. will be sent on the bus after the transaction. This is usually required for normal operation. If this parameter is zero then no stop bit will be omitted. In this case, no other task can use the component until a stop bit has been sent.
Return values:
``I2C_ACK`` – if the read was acknowledged by the device.
This function will cause a stop bit to be sent on the bus. It should be used to complete/abort a transaction if the send_stop_bit argument was not set when calling the rtos_i2c_master_read() or rtos_i2c_master_write() functions.
Parameters:
ctx – A pointer to the I2C master driver instance to use.
This function writes to an 8-bit addressed, 8-bit register in an I2C device. The function writes the data by sending the register address followed by the register data to the device at the specified device address.
Parameters:
ctx – A pointer to the I2C master driver instance to use.
device_addr – The address of the device to write to.
reg_addr – The address of the register to write to.
data – The 8-bit value to write.
Return values:
``I2C_REGOP_DEVICE_NACK`` – if the address is NACKed.
``I2C_REGOP_INCOMPLETE`` – if not all data was ACKed.
``I2C_REGOP_SUCCESS`` – on successful completion of the write.
This function reads from an 8-bit addressed, 8-bit register in an I2C device. The function reads the data by sending the register address followed reading the register data from the device at the specified device address.
Note that no stop bit is transmitted between the write and the read. The operation is performed as one transaction using a repeated start.
Parameters:
ctx – A pointer to the I2C master driver instance to use.
device_addr – The address of the device to read from.
reg_addr – The address of the register to read from.
data – A pointer to the byte to fill with data read from the register.
Return values:
``I2C_REGOP_DEVICE_NACK`` – if the device NACKed.
``I2C_REGOP_SUCCESS`` – on successful completion of the read.
The following functions may be used to share a I2C driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS I2C master driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_i2c_master_init(). The host tile that owns the actual instance must simultaneously call rtos_i2c_master_rpc_host_init().
Parameters:
i2c_master_ctx – A pointer to the I2C master driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as i2c_master_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as i2c_master_ctx.
Performs additional initialization on an I2C master driver instance to allow client tiles to use the I2C master driver instance. Each client tile that will use this instance must simultaneously call rtos_i2c_master_rpc_client_init().
Parameters:
i2c_master_ctx – A pointer to the I2C master driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as i2c_master_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as i2c_master_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for an I2C master driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_i2c_master_rpc_client_init(). After calling this, the client tile may immediately begin to call the core I2C master functions on this driver instance. It does not need to wait for the host to call rtos_i2c_master_start().
i2c_master_ctx – A pointer to the I2C master driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Function pointer type for application provided RTOS I2C slave start callback functions.
These callback functions are optionally called by an I2C slave driver’s thread when it is first started. This gives the application a chance to perform startup initialization from within the driver’s thread.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Function pointer type for application provided RTOS I2C slave transmit start callback functions.
These callback functions are called when an I2C slave driver instance needs to transmit data to a master device. This callback must provide the data to transmit and the length.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Param data:
A pointer to the data buffer to transmit to the master. The driver sets this to its internal data buffer, which has a size of RTOS_I2C_SLAVE_BUF_LEN, prior to calling this callback. This may be set to a different buffer by the callback. The callback must fill this buffer with the data to send to the master.
Return:
The number of bytes to transmit to the master from data. If the master reads more bytes than this, the driver will wrap around to the start of the buffer and send it again.
Function pointer type for application provided RTOS I2C slave transmit done callback functions.
These callback functions are optionally called when an I2C slave driver instance is done transmitting data to a master device. A buffer to the data sent and the actual number of bytes sent are provided to the callback.
The application may want to use this, for example, if the buffer that was sent was malloc’d. This callback can be used to free the buffer.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Param data:
A pointer to the data transmitted to the master.
Param len:
The number of bytes transmitted to the master from data.
Function pointer type for application provided function to check bytes received from master individually.
This callback function is called once per byte received from the master device.
The application may want to use this, for example, to check byte by byte and force a NACK for an unexpected payload.
The user provided functions must be marked with RTOS_I2C_SLAVE_MASTER_SENT_BYTE_CHECK_CALLBACK_ATTR.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Param data:
A copy of the most recent byte of data transmitted from the master.
Param cur_status:
A pointer to the current ACK/NACK response for this byte. The application may change this to I2C_SLAVE_ACK or I2C_SLAVE_NACK. If cur_status is returned as an invalid value, the driver will implicitly NACK.
Function pointer type for application provided function to alert application that there is a write transaction incoming from master
This allows an application to NACK if it is not ready for handling write requests.
The user provided functions must be marked with RTOS_I2C_SLAVE_WRITE_ADDR_REQUEST_CALLBACK_ATTR.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Param cur_status:
A pointer to the current ACK/NACK response for this byte. The application may change this to I2C_SLAVE_ACK or I2C_SLAVE_NACK. If cur_status is returned as an invalid value, the driver will implicitly NACK. By default the driver will implicitly ACK.
Starts an RTOS I2C slave driver instance. This must only be called by the tile that owns the driver instance. It must be called after starting the RTOS from an RTOS thread.
rtos_i2c_slave_init() must be called on this I2C slave driver instance prior to calling this.
Parameters:
i2c_slave_ctx – A pointer to the I2C slave driver instance to start.
app_data – A pointer to application specific data to pass to the callback functions.
start – The callback function that is called when the driver’s thread starts. This is optional and may be NULL.
rx – The callback function to receive data from the bus master.
tx_start – The callback function to transmit data to the bus master.
tx_done – The callback function that is notified when transmits are complete. This is optional and may be NULL.
rx_byte_check – The callback function to check received bytes individually.
write_addr_req – The callback function to alert an incoming write request
interrupt_core_id – The ID of the core on which to enable the I2C interrupt.
priority – The priority of the task that gets created by the driver to call the callback functions.
Initializes an RTOS I2C slave driver instance. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_i2c_slave_start().
Parameters:
i2c_slave_ctx – A pointer to the I2C slave driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low level I2C I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
p_scl – The port containing SCL. This must be a 1-bit port and different than p_sda.
p_sda – The port containing SDA. This must be a 1-bit port and different than p_scl.
device_addr – The 7-bit address of the slave device.
RTOS_I2C_SLAVE_BUF_LEN
The maximum number of bytes that a the RTOS I2C slave driver can receive from a master in a single write transaction.
RTOS_I2C_SLAVE_CALLBACK_ATTR
This attribute must be specified on all RTOS I2C slave callback functions provided by the application.
RTOS_I2C_SLAVE_RX_BYTE_CHECK_CALLBACK_ATTR
This attribute must be specified on all RTOS I2C slave rtos_i2c_slave_rx_byte_check_cb_t provided by the application.
RTOS_I2C_SLAVE_WRITE_ADDR_REQUEST_CALLBACK_ATTR
This attribute must be specified on all RTOS I2C slave rtos_i2c_slave_write_addr_request_cb_t provided by the application.
structrtos_i2c_slave_struct
#include <rtos_i2c_slave.h>
Struct representing an RTOS I2C slave driver instance.
The members in this struct should not be accessed directly.
Initializes an RTOS I2S driver instance in master mode. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_i2s_start() or any of the core I2S driver functions with this instance.
Parameters:
i2s_ctx – A pointer to the I2S driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low level I2S I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
p_dout – An array of data output ports.
num_out – The number of output data ports.
p_din – An array of data input ports.
num_in – The number of input data ports.
p_bclk – The bit clock output port.
p_lrclk – The word clock output port.
p_mclk – Input port which supplies the master clock.
bclk – A clock that will get configured for use with the bit clock.
Initializes an RTOS I2S driver instance in master mode but that uses an externally generated bit clock. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_i2s_start() or any of the core I2S driver functions with this instance.
Parameters:
i2s_ctx – A pointer to the I2S driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low level I2S I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
p_dout – An array of data output ports.
num_out – The number of output data ports.
p_din – An array of data input ports.
num_in – The number of input data ports.
p_bclk – The bit clock output port.
p_lrclk – The word clock output port.
bclk – A clock that is configured externally to be used as the bit clock
Initializes an RTOS I2S driver instance in slave mode. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_i2s_start() or any of the core I2S driver functions with this instance.
Parameters:
i2s_ctx – A pointer to the I2S driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low level I2S I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
p_dout – An array of data output ports.
num_out – The number of output data ports.
p_din – An array of data input ports.
num_in – The number of input data ports.
p_bclk – The bit clock input port.
p_lrclk – The word clock input port.
bclk – A clock that will get configured for use with the bit clock.
Function pointer type for application provided RTOS I2S send filter callback functions.
These callback functions are called when an I2S driver instance needs output the next audio frame to its interface. By default, audio frames in the driver’s send buffer are output directly to its interface. However, this gives the application an opportunity to override this and provide filtering.
These functions must not block.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Param i2s_frame:
A pointer to the buffer where the callback should write the next frame to send.
Param i2s_frame_size:
The number of samples that should be written to i2s_frame.
Param send_buf:
A pointer to the next frame in the driver’s send buffer. The callback should use this as the input to its filter.
Function pointer type for application provided RTOS I2S receive filter callback functions.
These callback functions are called when an I2S driver instance has received the next audio frame from its interface. By default, audio frames received from the driver’s interface are put directly into its receive buffer. However, this gives the application an opportunity to override this and provide filtering.
These functions must not block.
Param ctx:
A pointer to the associated I2C slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Param i2s_frame:
A pointer to the buffer where the callback should read the next received frame from The callback should use this as the input to its filter.
Param i2s_frame_size:
The number of samples that should be read from i2s_frame.
Param receive_buf:
A pointer to the next frame in the driver’s send buffer. The callback should use this as the input to its filter.
Starts an RTOS I2S driver instance. This must only be called by the tile that owns the driver instance. It must be called after starting the RTOS from an RTOS thread, and must be called before any of the core I2S driver functions are called with this instance.
i2s_ctx – A pointer to the I2S driver instance to start.
mclk_bclk_ratio – The master clock to bit clock ratio. This may be computed by the helper function rtos_i2s_mclk_bclk_ratio(). This is only used if the I2S instance was initialized with rtos_i2s_master_init(). Otherwise it is ignored.
mode – The mode of the LR clock. See i2s_mode_t.
recv_buffer_size – The size in frames of the input buffer. Each frame is two samples (left and right channels) per input port. For example, a size of two here when num_in is three would create a buffer that holds up to 12 samples.
send_buffer_size – The size in frames of the output buffer. Each frame is two samples (left and right channels) per output port. For example, a size of two here when num_out is three would create a buffer that holds up to 12 samples. Frames transmitted by rtos_i2s_tx() are stored in this buffers before they are sent out to the I2S interface.
interrupt_core_id – The ID of the core on which to enable the I2S interrupt.
RTOS_I2S_APP_SEND_FILTER_CALLBACK_ATTR
This attribute must be specified on all RTOS I2S send filter callback functions provided by the application.
RTOS_I2S_APP_RECEIVE_FILTER_CALLBACK_ATTR
This attribute must be specified on all RTOS I2S receive filter callback functions provided by the application.
structrtos_i2s_struct
#include <rtos_i2s.h>
Struct representing an RTOS I2S driver instance.
The members in this struct should not be accessed directly.
This function will block until new frames are available.
Parameters:
ctx – A pointer to the I2S driver instance to use.
i2s_sample_buf – A buffer to copy the received sample frames into.
frame_count – The number of frames to receive from the buffer. This must be less than or equal to the size of the input buffer specified to rtos_i2s_start().
timeout – The amount of time to wait before the requested number of frames becomes available.
Returns:
The number of frames actually received into i2s_sample_buf.
The samples are stored into a buffer and are not necessarily sent out to the I2S interface before this function returns.
Parameters:
ctx – A pointer to the I2S driver instance to use.
i2s_sample_buf – A buffer containing the sample frames to transmit out to the I2S interface.
frame_count – The number of frames to transmit out from the buffer. This must be less than or equal to the size of the output buffer specified to rtos_i2s_start().
timeout – The amount of time to wait before there is enough space in the send buffer to accept the frames to be transmitted.
Returns:
The number of frames actually stored into the buffer.
The following functions may be used to share a I2S driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS I2S driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of on of the RTOS I2S init functions. The host tile that owns the actual instance must simultaneously call rtos_i2s_rpc_host_init().
Parameters:
i2s_ctx – A pointer to the I2S driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as i2s_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as i2s_ctx.
Performs additional initialization on a I2S driver instance to allow client tiles to use the I2S driver instance. Each client tile that will use this instance must simultaneously call rtos_i2s_rpc_client_init().
Parameters:
i2s_ctx – A pointer to the I2S driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as i2s_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as i2s_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for a I2S driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_i2s_rpc_client_init(). After calling this, the client tile may immediately begin to call the core I2S functions on this driver instance. It does not need to wait for the host to call rtos_i2s_start().
i2s_ctx – A pointer to the I2S driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Starts an RTOS mic array driver instance. This must only be called by the tile that owns the driver instance. It must be called after starting the RTOS from an RTOS thread, and must be called before any of the core mic array driver functions are called with this instance.
rtos_mic_array_init() must be called on this mic array driver instance prior to calling this.
Parameters:
mic_array_ctx – A pointer to the mic array driver instance to start.
buffer_size – The size in frames of the input buffer. Each frame is two samples (one for each microphone) plus one sample per reference channel. This must be at least MIC_ARRAY_CONFIG_SAMPLES_PER_FRAME. Samples are pulled out of this buffer by the application by calling rtos_mic_array_rx().
interrupt_core_id – The ID of the core on which to enable the mic array interrupt.
Initializes an RTOS mic array driver instance. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_mic_array_start() or any of the core mic array driver functions with this instance.
Parameters:
mic_array_ctx – A pointer to the mic array driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low level mic array I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
format – Format of the output data
structrtos_mic_array_struct
#include <rtos_mic_array.h>
Struct representing an RTOS mic array driver instance.
The members in this struct should not be accessed directly.
Receives sample frames from the PDM mic array interface.
This function will block until new frames are available.
Parameters:
ctx – A pointer to the mic array driver instance to use.
sample_buf – A buffer to copy the received sample frames into.
frame_count – The number of frames to receive from the buffer. This must be less than or equal to the size of the buffer specified to rtos_mic_array_start() if in RTOS_MIC_ARRAY_SAMPLE_CHANNEL mode. This must be equal to MIC_ARRAY_CONFIG_SAMPLES_PER_FRAME if in RTOS_MIC_ARRAY_CHANNEL_SAMPLE mode.
timeout – The amount of time to wait before the requested number of frames becomes available.
Returns:
The number of frames actually received into sample_buf.
The following functions may be used to share a microphone array driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS mic array driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_mic_array_init(). The host tile that owns the actual instance must simultaneously call rtos_mic_array_rpc_host_init().
Parameters:
mic_array_ctx – A pointer to the mic array driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as mic_array_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as mic_array_ctx.
Performs additional initialization on a mic array driver instance to allow client tiles to use the mic array driver instance. Each client tile that will use this instance must simultaneously call rtos_mic_array_rpc_client_init().
Parameters:
mic_array_ctx – A pointer to the mic array driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as mic_array_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as mic_array_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for a mic array driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_mic_array_rpc_client_init(). After calling this, the client tile may immediately begin to call the core mic array functions on this driver instance. It does not need to wait for the host to call rtos_mic_array_start().
mic_array_ctx – A pointer to the mic array driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Starts an RTOS QSPI flash driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core QSPI flash driver functions are called with this instance.
rtos_qspi_flash_init() must be called on this QSPI flash driver instance prior to calling this.
Parameters:
ctx – A pointer to the QSPI flash driver instance to start.
priority – The priority of the task that gets created by the driver to handle the QSPI flash interface.
Sets the core affinity for a RTOS QSPI flash driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, and should be called before any of the core QSPI flash driver functions are called with this instance.
Since interrupts are disabled during the QSPI transaction on the op thread, a core mask is provided to allow users to avoid collisions with application ISRs.
rtos_qspi_flash_start() must be called on this QSPI flash driver instance prior to calling this.
Parameters:
ctx – A pointer to the QSPI flash driver instance to start.
op_core_mask – A bitmask representing the cores on which the QSPI I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
Initializes an RTOS QSPI flash driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_qspi_flash_start() or any of the core QSPI flash driver functions with this instance.
This function will initialize a flash driver using lib_quadflash for all operations.
Parameters:
ctx – A pointer to the QSPI flash driver instance to initialize.
clock_block – The clock block to use for the qspi_io interface.
cs_port – The chip select port. MUST be a 1-bit port.
sclk_port – The SCLK port. MUST be a 1-bit port.
sio_port – The SIO port. MUST be a 4-bit port.
spec – A pointer to the flash part specification. This may be set to NULL to use the XTC default
Initializes an RTOS QSPI flash driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_qspi_flash_start() or any of the core QSPI flash driver functions with this instance.
This function will initialize a flash driver using lib_quadflash for erase and writes, and lib_qspi_fast_read for reads. If calibration fails the driver will enable lib_quadflash for reads and allow the application to decide what to do about the failed calibration. The status of the calibration can be checked at runtime by calling rtos_qspi_flash_calibration_valid_get().
Parameters:
ctx – A pointer to the QSPI flash driver instance to initialize.
clock_block – The clock block to use for the qspi_io interface.
cs_port – The chip select port. MUST be a 1-bit port.
sclk_port – The SCLK port. MUST be a 1-bit port.
sio_port – The SIO port. MUST be a 4-bit port.
spec – A pointer to the flash part specification. This may be set to NULL to use the XTC default
read_mode – The transfer mode to use for port reads. Invalid values will default to qspi_fast_flash_read_transfer_raw
read_divide – The divisor to use for QSPI SCLK.
calibration_pattern_addr – The address of the default calibration pattern. This driver requires the default calibration pattern supplied with lib_qspi_fast_read and does not support custom patterns.
RTOS_QSPI_FLASH_READ_CHUNK_SIZE
structrtos_qspi_flash_struct
#include <rtos_qspi_flash.h>
Struct representing an RTOS QSPI flash driver instance.
The members in this struct should not be accessed directly.
Obtains a lock for exclusive access to the QSPI flash. This allows a thread to perform a sequence of operations (such as read, modify, erase, write) without the risk of another thread issuing a command in the middle of the sequence and corrupting the data in the flash.
If only a single atomic operation needs to be performed, such as a read, it is not necessary to call this to obtain the lock first. Each individual operation obtains and releases the lock automatically so that they cannot run while another thread has the lock.
This is a lower level version of rtos_qspi_flash_read() that is safe to call from within ISRs. If a task currently own the flash lock, or if another core is actively doing a read with this function, then the read will not be performed and an error returned. It is up to the application to determine what it should do in this situation and to avoid a potential deadlock.
This function may only be called on the same tile as the underlying peripheral.
This function uses the lib_quadflash API to perform the read. It is up to the application to ensure that XCORE resources are properly configured.
Note
It is not possible to call this from a task that currently owns the flash lock taken with rtos_qspi_flash_lock(). In general it is not advisable to call this from an RTOS task unless the small amount of overhead time that is introduced by rtos_qspi_flash_read() is unacceptable.
Parameters:
ctx – A pointer to the QSPI flash driver instance to use.
data – Pointer to the buffer to save the read data to.
address – The byte address in the flash to begin reading at. Only bits 23:0 contain the address. Bits 31:24 are ignored.
len – The number of bytes to read and save to data.
Return values:
0 – if the flash was available and the read operation was performed.
-1 – if the flash was unavailable and the read could not be performed.
This is a lower level version of rtos_qspi_flash_read() that is safe to call from within ISRs. If a task currently own the flash lock, or if another core is actively doing a read with this function, then the read will not be performed and an error returned. It is up to the application to determine what it should do in this situation and to avoid a potential deadlock.
This function may only be called on the same tile as the underlying peripheral.
This function uses the lib_qspi_fast_read API to perform the read. It is up to the application to ensure that XCORE resources are properly configured.
Note
It is not possible to call this from a task that currently owns the flash lock taken with rtos_qspi_flash_lock(). In general it is not advisable to call this from an RTOS task unless the small amount of overhead time that is introduced by rtos_qspi_flash_read() is unacceptable.
Parameters:
ctx – A pointer to the QSPI flash driver instance to use.
data – Pointer to the buffer to save the read data to.
address – The byte address in the flash to begin reading at. Only bits 23:0 contain the address. Bits 31:24 are ignored.
len – The number of bytes to read and save to data.
Return values:
0 – if the flash was available and the read operation was performed.
-1 – if the flash was unavailable and the read could not be performed.
This is a lower level version of rtos_qspi_flash_read_mode() that is safe to call from within ISRs. If a task currently own the flash lock, or if another core is actively doing a read with this function, then the read will not be performed and an error returned. It is up to the application to determine what it should do in this situation and to avoid a potential deadlock.
This function may only be called on the same tile as the underlying peripheral.
This function uses the lib_qspi_fast_read API to perform the read. It is up to the application to ensure that XCORE resources are properly configured.
Note
It is not possible to call this from a task that currently owns the flash lock taken with rtos_qspi_flash_lock(). In general it is not advisable to call this from an RTOS task unless the small amount of overhead time that is introduced by rtos_qspi_flash_read_mode() is unacceptable.
Parameters:
ctx – A pointer to the QSPI flash driver instance to use.
data – Pointer to the buffer to save the read data to.
address – The byte address in the flash to begin reading at. Only bits 23:0 contain the address. Bits 31:24 are ignored.
len – The number of bytes to read and save to data.
mode – The transfer mode for this read operation data.
Return values:
0 – if the flash was available and the read operation was performed.
-1 – if the flash was unavailable and the read could not be performed.
This writes data to the QSPI flash. The standard page program command is sent and only SIO0 (MOSI) is used to send the address and data.
The driver handles sending the write enable command, as well as waiting for the write to complete.
This function may return before the write operation is complete, as the actual write operation is queued and executed by a thread created by the driver.
Note
this function does NOT erase the flash first. Erase operations must be explicitly requested by the application.
Parameters:
ctx – A pointer to the QSPI flash driver instance to use.
data – Pointer to the data to write to the flash.
address – The byte address in the flash to begin writing at. Only bits 23:0 contain the address. The byte in bits 31:24 is not sent.
This erases data from the QSPI flash. If the address range to erase spans multiple sectors, then all of these sectors will be erased by issuing multiple erase commands.
The driver handles sending the write enable command, as well as waiting for the write to complete.
This function may return before the write operation is complete, as the actual erase operation is queued and executed by a thread created by the driver.
Note
The smallest amount of data that can be erased is a 4k sector. This means that data outside the address range specified by address and len will be erased if the address range does not both begin and end at 4k sector boundaries.
Parameters:
ctx – A pointer to the QSPI flash driver instance to use.
address – The byte address to begin erasing. This does not need to begin at a sector boundary, but if it does not, note that the entire sector that contains this address will still be erased.
len – The minimum number of bytes to erase. If address + len - 1 does not correspond to the last address within a sector, note that the entire sector that contains this address will still be erased.
The following functions may be used to share a QSPI flash driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS QSPI flash driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_qspi_flash_init(). The host tile that owns the actual instance must simultaneously call rtos_qspi_flash_rpc_host_init().
Parameters:
qspi_flash_ctx – A pointer to the QSPI flash driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as qspi_flash_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as qspi_flash_ctx.
Performs additional initialization on a QSPI flash driver instance to allow client tiles to use the QSPI flash driver instance. Each client tile that will use this instance must simultaneously call rtos_qspi_flash_rpc_client_init().
Parameters:
qspi_flash_ctx – A pointer to the QSPI flash driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as qspi_flash_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as qspi_flash_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for a QSPI flash driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_qspi_flash_rpc_client_init(). After calling this, the client tile may immediately begin to call the core QSPI flash functions on this driver instance. It does not need to wait for the host to call rtos_qspi_flash_start().
qspi_flash_ctx – A pointer to the QSPI flash driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Starts an RTOS SPI master driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core SPI master driver functions are called with this instance.
rtos_spi_master_init() must be called on this SPI master driver instance prior to calling this.
Parameters:
spi_master_ctx – A pointer to the SPI master driver instance to start.
priority – The priority of the task that gets created by the driver to handle the SPI master interface.
Initializes an RTOS SPI master driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_spi_master_start() or any of the core SPI master driver functions with this instance.
Parameters:
bus_ctx – A pointer to the SPI master driver instance to initialize.
clock_block – The clock block to use for the SPI master interface.
cs_port – The SPI interface’s chip select port. This may be a multi-bit port.
sclk_port – The SPI interface’s SCLK port. Must be a 1-bit port.
mosi_port – The SPI interface’s MOSI port. Must be a 1-bit port.
miso_port – The SPI interface’s MISO port. Must be a 1-bit port.
Initialize a SPI device. Multiple SPI devices may be initialized per RTOS SPI master driver instance. Each must be on a unique pin of the interface’s chip select port. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_spi_master_start() or any of the core SPI master driver functions with this instance.
Parameters:
dev_ctx – A pointer to the SPI device instance to initialize.
bus_ctx – A pointer to the SPI master driver instance to attach the device to.
cs_pin – The bit number of the chip select port that is connected to the device’s chip select pin.
cpol – The clock polarity required by the device.
cpha – The clock phase required by the device.
source_clock – The source clock to derive SCLK from. See spi_master_source_clock_t.
clock_divisor – The value to divide the source clock by. The frequency of SCLK will be set to:
(F_src) / (4 * clock_divisor) when clock_divisor > 0
(F_src) / (2) when clock_divisor = 0 Where F_src is the frequency of the source clock.
miso_sample_delay – When to sample MISO. See spi_master_sample_delay_t.
miso_pad_delay – The number of core clock cycles to delay sampling the MISO pad during a transaction. This allows for more fine grained adjustment of sampling time. The value may be between 0 and 5.
cs_to_clk_delay_ticks – The minimum number of reference clock ticks between assertion of chip select and the first clock edge.
clk_to_cs_delay_ticks – The minimum number of reference clock ticks between the last clock edge and de-assertion of chip select.
cs_to_cs_delay_ticks – The minimum number of reference clock ticks between transactions, which is between de-assertion of chip select and the end of one transaction, and its re-assertion at the beginning of the next.
structrtos_spi_master_struct
#include <rtos_spi_master.h>
Struct representing an RTOS SPI master driver instance.
The members in this struct should not be accessed directly.
structrtos_spi_master_device_struct
#include <rtos_spi_master.h>
Struct representing an RTOS SPI device instance.
The members in this struct should not be accessed directly.
Starts a transaction with the specified SPI device on a SPI bus. This leaves chip select asserted.
Note: When this is called, the servicer thread will be locked to the core that it executed on until rtos_spi_master_transaction_end() is called. This is because the underlying I/O software utilized fast mode and high priority.
Transfers data to and from the specified SPI device on a SPI bus. The transaction must already have been started by calling rtos_spi_master_transaction_start() on the same device instance. This may be called multiple times during a single transaction.
This function may return before the transfer is complete when data_in is NULL, as the actual transfer operation is queued and executed by a thread created by the driver.
Parameters:
ctx – A pointer to the SPI device instance.
data_out – Pointer to the data to transfer to the device. This may be NULL if there is no data to send.
data_in – Pointer to the buffer to save the received data to. This may be NULL if the received data is not needed.
len – The number of bytes to transfer in each direction. This number of bytes must be available in both the data_out and data_in buffers if they are not NULL.
If there is a minimum amount of idle time that is required by the device between transfers within a single transaction, then this may be called between each transfer where a delay is required.
This function will return immediately. If the call for the next transfer happens before the minimum time specified has elapsed, the delay will occur then before the transfer begins.
Note
This must be called during a transaction, otherwise the behavior is unspecified.
Note
Technically the next transfer will occur no earlier than delay_ticks after this function is called, so this should be called immediately following a transfer, rather than immediately before the next.
Parameters:
ctx – A pointer to the SPI device instance.
delay_ticks – The number of reference clock ticks to delay.
The following functions may be used to share a SPI master driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS SPI master driver instance on a client tile, as well as any number of SPI device instances. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_spi_master_init() and rtos_spi_master_device_init(). The host tile that owns the actual instances must simultaneously call rtos_spi_master_rpc_host_init().
Parameters:
spi_master_ctx – A pointer to the SPI master driver instance to initialize.
spi_device_ctx – An array of pointers to SPI device instances to initialize.
spi_device_count – The number of SPI device instances to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as spi_master_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as spi_master_ctx.
Performs additional initialization on a SPI master driver instance to allow client tiles to use the SPI master driver instance. Each client tile that will use this instance must simultaneously call rtos_spi_master_rpc_client_init().
Parameters:
spi_master_ctx – A pointer to the SPI master driver instance to share with clients.
spi_device_ctx – An array of pointers to SPI device instances to share with clients.
spi_device_count – The number of SPI device instances to share.
rpc_config – A pointer to an RPC config struct. This must have the same scope as spi_master_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as spi_master_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for a SPI master driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_spi_master_rpc_client_init(). After calling this, the client tile may immediately begin to call the core SPI master functions on this driver instance. It does not need to wait for the host to call rtos_spi_master_start().
spi_master_ctx – A pointer to the SPI master driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Function pointer type for application provided RTOS SPI slave start callback functions.
These callback functions are optionally called by a SPI slave driver’s thread when it is first started. This gives the application a chance to perform startup initialization from within the driver’s thread. It is a good place for the first call to spi_slave_xfer_prepare().
Param ctx:
A pointer to the associated SPI slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Function pointer type for application provided RTOS SPI slave transfer done callback functions.
These callback functions are optionally called when a SPI slave driver instance is done transferring data with a master device.
An application can use this to be notified immediately when a transfer has completed. It can then call spi_slave_xfer_complete() with a timeout of 0 from within this callback to get the transfer results.
Param ctx:
A pointer to the associated SPI slave driver instance.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Prepares an RTOS SPI slave driver instance with buffers for subsequent transfers. Before this is called for the first time, any transfers initiated by a master device with result in all received data over MOSI being dropped, and all data sent over MISO being zeros.
This only needs to be called when the buffers need to be changed. If all transfers will use the same buffers, then this only needs to be called once during initialization.
If the application has not processed the previous transaction, the buffers will be held, and default buffers set by spi_slave_xfer_prepare_default_buffers() will be used if a new transaction starts.
Parameters:
ctx – A pointer to the SPI slave driver instance to use.
rx_buf – The buffer to receive data into for any subsequent transfers.
rx_buf_ – The length in bytes of rx_buf. If the master transfers more than this during a single transfer, then the bytes that do not fit within rx_buf will be lost.
tx_buf – The buffer to send data from for any subsequent transfers.
tx_buf_len – The length in bytes of tx_buf. If the master transfers more than this during a single transfer, zeros will be sent following the last byte tx_buf.
Prepares an RTOS SPI slave driver instance with default buffers for subsequent transfers. Before this is called for the first time, any transfers initiated by a master device with result in all received data over MOSI being dropped, and all data sent over MISO being zeros.
This only needs to be called when the buffers need to be changed.
The default buffer will be used in the event that the application has not yet processed the previous transfer. This enables the application to have a default buffer to implement a sort of NACK over SPI in the event that the device was busy and had not yet finished handling the previous transaction before a new one started.
Parameters:
ctx – A pointer to the SPI slave driver instance to use.
rx_buf – The buffer to receive data into for any subsequent transfers.
rx_buf_ – The length in bytes of rx_buf. If the master transfers more than this during a single transfer, then the bytes that do not fit within rx_buf will be lost.
tx_buf – The buffer to send data from for any subsequent transfers.
tx_buf_len – The length in bytes of tx_buf. If the master transfers more than this during a single transfer, zeros will be sent following the last byte tx_buf.
Waits for a SPI transfer to complete. Returns either when the timeout is reached, or when a transfer completes, whichever comes first. If a transfer does complete, then the buffers and the number of bytes read from or written to them are returned via the parameters.
Note
The duration of this callback will effect the minimum duration between SPI transactions
Parameters:
ctx – A pointer to the SPI slave driver instance to use.
rx_buf – The receive buffer used for the completed transfer. This is set by the function upon completion of a transfer.
rx_len – The number of bytes written to rx_buf. This is set by the function upon completion of a transfer.
tx_buf – The transmit buffer used for the completed transfer. This is set by the function upon completion of a transfer.
tx_len – The number of bytes sent from tx_buf. This is set by the function upon completion of a transfer.
timeout – The number of RTOS ticks to wait before the next transfer is complete. When called from within the “xfer_done” callback, this should be 0.
Return values:
0 – if a transfer completed. All buffers and lengths are set in this case.
-1 – if no transfer completed before the timeout expired. No buffers or lengths are returned in this case.
Sets the driver to use callbacks for all default transactions. This will result in transfers done with the default buffer generating callbacks to the application to xfer_done. This will require default buffer transaction items to be processed with spi_slave_xfer_complete()
Note
This is the default setting
Parameters:
ctx – A pointer to the SPI slave driver instance to use.
Sets the driver to drop all default transactions. This will result in transfers done with the default buffer not generating callbacks to the application to xfer_done. This will also stop default buffer transaction items from being required to be processed with spi_slave_xfer_complete()
Parameters:
ctx – A pointer to the SPI slave driver instance to use.
Starts an RTOS SPI slave driver instance. This must only be called by the tile that owns the driver instance. It must be called after starting the RTOS from an RTOS thread.
rtos_spi_slave_init() must be called on this SPI slave driver instance prior to calling this.
Parameters:
spi_slave_ctx – A pointer to the SPI slave driver instance to start.
app_data – A pointer to application specific data to pass to the callback functions.
start – The callback function that is called when the driver’s thread starts. This is optional and may be NULL.
xfer_done – The callback function that is notified when transfers are complete. This is optional and may be NULL.
interrupt_core_id – The ID of the core on which to enable the SPI interrupt. This core should not be shared with threads that disable interrupts for long periods of time, nor enable other interrupts.
priority – The priority of the task that gets created by the driver to call the callback functions. If both callback functions are NULL, then this is unused.
Initializes an RTOS SPI slave driver instance. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_spi_slave_start().
For timing parameters and maximum clock rate, refer to the underlying HIL IO API.
Parameters:
spi_slave_ctx – A pointer to the SPI slave driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low level SPI I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
clock_block – The clock block to use for the SPI slave.
cpol – The clock polarity to use.
cpha – The clock phase to use.
p_sclk – The SPI slave’s SCLK port. Must be a 1-bit port.
p_mosi – The SPI slave’s MOSI port. Must be a 1-bit port.
p_miso – The SPI slave’s MISO port. Must be a 1-bit port.
p_cs – The SPI slave’s CS port. Must be a 1-bit port.
RTOS_SPI_SLAVE_CALLBACK_ATTR
This attribute must be specified on all RTOS SPI slave callback functions provided by the application.
HIL_IO_SPI_SLAVE_HIGH_PRIO
Set SPI Slave thread to high priority
HIL_IO_SPI_SLAVE_FAST_MODE
Set SPI Slave thread to run in fast mode
structxfer_done_queue_item
#include <rtos_spi_slave.h>
Internally used struct representing an received data packet.
The members in this struct should not be accessed directly.
structrtos_spi_slave_struct
#include <rtos_spi_slave.h>
Struct representing an RTOS SPI slave driver instance.
The members in this struct should not be accessed directly.
Writes data to an initialized and started UART instance. Unlike the UART rx, an xcore logical core is not reserved. The UART transmission is a function call and the the function blocks until the stop bit of the last byte to be transmittted has completed. Interrupts are masked during this time to avoid stretching of the waveform. Consequently, the tx consumes cycles from the caller thread.
Parameters:
ctx – A pointer to the UART Tx driver instance to use.
Initialises an RTOS UART tx driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_uart_tx_start() or any of the core UART tx driver functions with this instance.
Parameters:
ctx – A pointer to the UART tx driver instance to initialise.
tx_port – The port containing the transmit pin
baud_rate – The baud rate of the UART in bits per second.
num_data_bits – The number of data bits per frame sent.
parity – The type of parity used. See uart_parity_t above.
stop_bits – The number of stop bits asserted at the of the frame.
tmr – The resource id of the timer to be used by the UART tx.
Starts an RTOS UART tx driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core UART tx driver functions are called with this instance.
rtos_uart_tx_init() must be called on this UART tx driver instance prior to calling this.
Parameters:
ctx – A pointer to the UART tx driver instance to start.
structrtos_uart_tx_struct
#include <rtos_uart_tx.h>
Struct representing an RTOS UART tx driver instance.
The members in this struct should not be accessed directly.
The following functions may be used to share a UART Tx driver instance with other xCORE tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS UART tx driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_uart_tx_init(). The host tile that owns the actual instance must simultaneously call rtos_uart_tx_rpc_host_init().
Parameters:
uart_tx_ctx – A pointer to the UART tx driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as uart_tx_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as uart_tx_ctx.
Performs additional initialization on an UART tx driver instance to allow client tiles to use the UART tx driver instance. Each client tile that will use this instance must simultaneously call rtos_uart_tx_rpc_client_init().
Parameters:
uart_tx_ctx – A pointer to the UART tx driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as uart_tx_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as uart_tx_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for an UART tx driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_uart_tx_rpc_client_init(). After calling this, the client tile may immediately begin to call the core UART tx functions on this driver instance. It does not need to wait for the host to call rtos_uart_tx_start().
uart_tx_ctx – A pointer to the UART tx driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Function pointer type for application provided RTOS UART rx start callback functions.
This callback function is optionally (may be NULL) called by an UART rx driver’s thread when it is first started. This gives the application a chance to perform startup initialization from within the driver’s thread.
Param ctx:
A pointer to the associated UART rx driver instance.
Function pointer type for application provided RTOS UART rx receive callback function.
This callback functions are called when an UART rx driver instance has received data to a specified depth. Please use the xStreamBufferReceive(rtos_uart_rx_ctx->isr_byte_buffer, … to read the bytes.
Param ctx:
A pointer to the associated UART rx driver instance.
Function pointer type for application provided RTOS UART rx error callback functions.
This callback function is optionally (may be NULL_ called when an UART rx driver instance experiences an error in reception. These error types are defined in uart.h of the underlying HIL driver but can be of the following types for the RTOS rx: UART_START_BIT_ERROR, UART_PARITY_ERROR, UART_FRAMING_ERROR, UART_OVERRUN_ERROR.
Param ctx:
A pointer to the associated UART rx driver instance.
Param err_flags:
An 8b word containing error flags set during reception of last frame. See rtos_uart_rx.h for the bit field definitions.
Initializes an RTOS UART rx driver instance. This must only be called by the tile that owns the driver instance. It should be called before starting the RTOS, and must be called before calling rtos_uart_rx_start(). Note that UART rx requires a whole logical core for the underlying HIL UART Rx instance.
Parameters:
uart_rx_ctx – A pointer to the UART rx driver instance to initialize.
io_core_mask – A bitmask representing the cores on which the low UART Rx thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
rx_port – The port containing the receive pin
baud_rate – The baud rate of the UART in bits per second.
data_bits – The number of data bits per frame sent.
parity – The type of parity used. See uart_parity_t above.
stop_bits – The number of stop bits asserted at the of the frame.
tmr – The resource id of the timer to be used by the UART Rx.
Starts an RTOS UART rx driver instance. This must only be called by the tile that owns the driver instance. It must be called after starting the RTOS and from an RTOS thread.
rtos_uart_rx_init() must be called on this UART rx driver instance prior to calling this.
Parameters:
uart_rx_ctx – A pointer to the UART rx driver instance to start.
app_data – A pointer to application specific data to pass to the callback functions available in rtos_uart_rx_struct.
start – The callback function that is called when the driver’s thread starts. This is optional and may be NULL.
rx_complete – The callback function to indicate data received by the UART.
error – The callback function called when a reception error has occured.
interrupt_core_id – The ID of the core on which to enable the UART rx interrupt.
priority – The priority of the task that gets created by the driver to call the callback functions.
app_rx_buff_size – The size in bytes of the RTOS xstreambuffer used to buffer received words for the application.
UR_COMPLETE_CB_CODE
The callback code bit positions available for RTOS UART Rx.
UR_STARTED_CB_CODE
UR_START_BIT_ERR_CB_CODE
UR_PARITY_ERR_CB_CODE
UR_FRAMING_ERR_CB_CODE
UR_OVERRUN_ERR_CB_CODE
UR_COMPLETE_CB_FLAG
The callback code flag masks available for RTOS UART Rx.
UR_STARTED_CB_FLAG
UR_START_BIT_ERR_CB_FLAG
UR_PARITY_ERR_CB_FLAG
UR_FRAMING_ERR_CB_FLAG
UR_OVERRUN_ERR_CB_FLAG
RX_ERROR_FLAGS
RX_ALL_FLAGS
RTOS_UART_RX_BUF_LEN
The size of the byte buffer between the ISR and the appthread. It needs to be able to hold sufficient bytes received until the app_thread is able to service it. This is not the same as app_byte_buffer_size which can be of any size, specified by the user at device start. At 1Mbps we get a byte every 10us so 64B allows 640us for the app thread to respond. Note buffer is size n+1 as required by lib_uart.
RTOS_UART_RX_CALLBACK_ATTR
This attribute must be specified on all RTOS UART rx callback functions provided by the application to allow compiler stack calculation.
RTOS_UART_RX_CALL_ATTR
This attribute must be specified on all RTOS UART functions provided by the application to allow compiler stack calculation.
structrtos_uart_rx_struct
#include <rtos_uart_rx.h>
Struct representing an RTOS UART rx driver instance.
The members in this struct should not be accessed directly.
This driver can be used to instantiate and control a USB device interface on xcore in an RTOS application.
Unlike most other xcore I/O interface RTOS drivers, only a single USB driver instance may be started. It also does not require an initialization step prior to starting the driver. This is due to an implementation detail in lib_xud, which is what the RTOS USB driver uses at its core.
Requests a transfer on a USB endpoint. This function returns immediately. When the transfer is complete, the application’s ISR callback provided to rtos_usb_start() will be called.
Parameters:
ctx – A pointer to the USB driver instance to use.
endpoint_addr – The address of the endpoint to perform the transfer on.
buffer – A pointer to the buffer to transfer data into for OUT endpoints, or from for IN endpoints. For OUT endpoint, the buffer needs an additional +4 bytes of space, this additional data should not be reflected in the len parameter.
len – The maximum number of bytes to receive for OUT endpoints, or the actual number of bytes to send for IN endpoints.
is_setup – To be set when preparing for the transfer of a setup packet.
Return values:
XUD_RES_OKAY – if the transfer was requested successfully.
XUD_RES_RST – if the transfer was not requested and the USB bus needs to be reset. In this case, the application should reset the USB bus.
This function will complete a reset on an endpoint. The address of the endpoint to reset must be provided, and may be either direction (IN or OUT) endpoint. If there is an associated endpoint of the opposite direction, however, it will also be reset.
The return value should be inspected to find the new bus-speed.
Parameters:
endpoint_addr – IN or OUT endpoint address to reset.
Return values:
XUD_SPEED_HS – the host has accepted that this device can execute at high speed.
XUD_SPEED_FS – the device is running at full speed.
Sets the USB device’s bus address. This function must be called after a setDeviceAddress request is made by the host, and after the ZLP status is sent.
Parameters:
ctx – A pointer to the USB driver instance to use.
Stalls a USB endpoint. The stall is cleared automatically when a setup packet is received on the endpoint. Otherwise it can be cleared manually with rtos_usb_endpoint_stall_clear().
Parameters:
ctx – A pointer to the USB driver instance to use.
endpoint_addr – The address of the endpoint to stall.
Starts the USB driver instance’s low level USB I/O thread and enables its interrupts on the requested core. This must only be called by the tile that owns the driver instance. It must be called after starting the RTOS from an RTOS thread.
rtos_usb_init() must be called on this USB driver instance prior to calling this.
Parameters:
ctx – A pointer to the USB driver instance to start.
endpoint_count –
The number of endpoints that will be used by the application. A single endpoint here includes both its IN and OUT endpoints. For example, if the application uses EP0_IN, EP0_OUT, EP1_IN, EP2_IN, EP2_OUT, EP3_OUT, then the endpoint count specified here should be 4 (endpoint 0 through endpoint 3) regardless of the lack of EP1_OUT and EP3_IN. If these two endpoints were used, the count would still be 4.
If for whatever reason, the application needs to use a particular endpoint number, say only EP6 in addition to EP0, then the count here needs to be 7, even though endpoints 1 through 5 are unused. All unused endpoints must be marked as disabled in the two endpoint type lists
endpoint_out_type and endpoint_in_type.
endpoint_out_type – A list of the endpoint types for each output endpoint. Index 0 represents the type for EP0_OUT, and so on. See XUD_EpType in lib_xud. If the endpoint is unused, it must be set to XUD_EPTYPE_DIS.
endpoint_in_type – A list of the endpoint types for each input endpoint. Index 0 represents the type for EP0_IN, and so on. See XUD_EpType in lib_xud. If the endpoint is unused, it must be set to XUD_EPTYPE_DIS.
speed – The speed at which the bus should operate. Either XUD_SPEED_FS or XUD_SPEED_HS. See XUD_BusSpeed_t in lib_xud.
power_source – The source of the device’s power. Either bus powered (XUD_PWR_BUS) or self powered (XUD_PWR_SELF). See XUD_PwrConfig in lib_xud.
interrupt_core_id – The ID of the core on which to enable the USB interrupts.
sof_interrupt_core_id – The ID of the core on which to enable the SOF interrupt. Set to < 0 to disable the SoF interrupt if it is not needed.
Initializes an RTOS USB driver instance. This must only be called by the tile that owns the driver instance. It should be called prior to starting the RTOS, and must be called before any of the core USB driver functions are called with this instance.
This will create an RTOS thread that runs lib_xud’s main loop. This thread is created with the highest priority and with preemption disabled.
Note
Due to implementation details of lib_xud, it is only possible to have one USB instance per application. Functionally this is not an issue, as no xcore chips have more than one USB interface.
Note
If using the Tiny USB stack, then this function should not be called directly by the application. The xcore device port for Tiny USB takes care of calling this, as well as all other USB driver functions.
Parameters:
ctx – A pointer to the USB driver instance to start.
io_core_mask – A bitmask representing the cores on which the low level USB I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
isr_cb – The callback function for the driver to call when transfers are completed.
isr_app_data – A pointer to application specific data to pass to the application’s ISR callback function isr_cb.
This function may be called to wait for a transfer on a particular endpoint to complete. This requires that the USB instance was initialized with rtos_usb_simple_init().
Parameters:
ctx – A pointer to the USB driver instance to use.
endpoint_addr – The address of the endpoint to wait for.
len – The actual number of bytes transferred. For IN endpoints, this will be the same as the length requested by rtos_usb_endpoint_transfer_start(). For OUT endpoints, it may be less.
timeout – The maximum amount of time to wait for the transfer to complete before returning.
Return values:
XUD_RES_OKAY – if the transfer was completed successfully.
XUD_RES_RST – if the transfer was not able to complete and the USB bus needs to be reset. In this case, the application should reset the USB bus.
XUD_RES_ERR – if there was an unexpected error transferring the data.
Initializes an RTOS USB driver instance. This must only be called by the tile that owns the driver instance. It should be called prior to starting the RTOS, and must be called before any of the core USB driver functions are called with this instance.
This initialization function may be used instead of rtos_usb_init() if the application is not using a USB stack. This allows application threads to wait for transfers to complete with the rtos_usb_simple_transfer_complete() function. The application cannot provide its own ISR callback when initialized with this function. This provides a similar programming interface as a traditional bare metal xcore application using lib_xud.
This will create an RTOS thread that runs lib_xud’s main loop. This thread is created with the highest priority and with preemption disabled.
Note
Due to implementation details of lib_xud, it is only possible to have one USB instance per application. Functionally this is not an issue, as no xcore chips have more than one USB interface.
Parameters:
ctx – A pointer to the USB driver instance to start.
io_core_mask – A bitmask representing the cores on which the low level USB I/O thread created by the driver is allowed to run. Bit 0 is core 0, bit 1 is core 1, etc.
RTOS_USB_ENDPOINT_COUNT_MAX
The maximum number of USB endpoint numbers supported by the RTOS USB driver.
RTOS_USB_ISR_CALLBACK_ATTR
This attribute must be specified on the RTOS USB interrupt callback function provided by the application.
structrtos_usb_ep_xfer_info_t
#include <rtos_usb.h>
Struct to hold USB transfer state data per endpoint, used as the argument to the ISR.
The members in this struct should not be accessed directly.
structrtos_usb_struct
#include <rtos_usb.h>
Struct representing an RTOS USB driver instance.
The members in this struct should not be accessed directly.
This driver can be used to instantiate an xscope-based trace module in an RTOS
application. The trace module currently supports both a demonstrative ASCII-mode
and Percepio’s Tracealzyer on FreeRTOS. Both modes are dependent on
RTOS-specific hooks/macros to handle the majority of RTOS event recording and
integration.
For general usage of the FreeRTOS trace functionality please refer to FreeRTOS’
documentation here:
RTOS Trace Macros
For basic information on printf debugging using xscope please refer to the tools
guide here:
XSCOPE debugging
The trace driver supports Percepio’s Tracealyzer, a feature rich tool for
working with trace files. This implementation supports Tracealyzer’s
streaming mode; currently, snapshot mode is not supported. The current
underlying trace recording implementation interfaces with the
xscope_core_bytes API function (on Probe 0).
To select Tracealyzer as the trace module’s event recorder, the following must
be set. This can be applied at the CMake project level:
In addition to the configuration steps outlined above, Percepio’s Tracealyzer
streaming mode needs additional function calls to start recording trace data. In
the most basic use-case, the following functions should be called on the XCORE
tile that is to record trace data:
xTraceInitialize();xTraceEnable(TRC_START);
Note
xTraceInitialize must be called before any RTOS interaction
(before any traced objects are being interacted with). It is advisable to
call it as soon as possible in the application.
The Percepio’s Tracealzyer C-unit outputs to a stream-able file format called
Percepio Streaming Format (PSF). The xscope2psf utility aids in the extraction
of the PSF file from the underlying xscope communication (making it readily
available on the host’s filesystem). This tool can be configured to read from a
VCD (value change dump) file that is generated when specifying the xgdb option
–xscope-port <ip:port>, or it can be configured as an xscope-endpoint when
specifying the –xscope-port <ip:port> option. Both options can be processed
by the Tracealyzer graphical tool either as a post processing step or live.
Note
xscope2psf currently resides in a Tracealyzer example application here:
example.
This is likely to change in the future. Refer to either the README or the
application’s help documentation for usage details.
Note
Currently, the only supported PSF Streaming target connection type is
File System. Ensure this connection type is specified under Tracealyzer’s
Recording Settings.
For general usage of Tracealyzer please refer to the Percepio’s documentation here:
Manual
The trace driver supports a basic ASCII mode that is primarily meant as an
example for expanding support to other tracing tools/frameworks. In this mode,
only the following FreeRTOS trace hooks are supported:
traceTASK_SWITCHED_IN
traceTASK_SWITCHED_OUT
This implementation will produce xscope logs for the RTOS task switching. The
underlying xscope API xscope_core_bytes is used for communicating this
information.
To select ASCII mode as the trace module’s event recorder, the following must
be set. This can be applied at the CMake project level:
#define USE_TRACE_MODE TRACE_MODE_XSCOPE_ASCII
Note
xcore_trace.h contains the definition for these modes.
To begin capturing ASCII mode traces, run xgdb with the –xscope-file
option. Task switching events will be recorded to the specified VCD (value
change dump) file.
Starts an RTOS clock control driver instance. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core clock control driver functions are called with this instance.
rtos_clock_control_init() must be called on this clock control driver instance prior to calling this.
Parameters:
ctx – A pointer to the clock control driver instance to start.
Initializes an RTOS clock control driver instance. There should only be one per tile. This must only be called by the tile that owns the driver instance. It may be called either before or after starting the RTOS, but must be called before calling rtos_clock_control_start() or any of the core clock control driver functions with this instance.
Parameters:
ctx – A pointer to the GPIO driver instance to initialize.
structrtos_clock_control_struct
#include <rtos_clock_control.h>
Struct representing an RTOS clock control driver instance.
The members in this struct should not be accessed directly.
Sets the tile clock PLL control register value on the tile that owns this driver instance. The value set is calculated from the divider stage 1, multiplier stage, and divider stage 2 values provided.
VCO freq = fosc * (F + 1) / (2 * (R + 1)) VCO must be between 260MHz and 1.3GHz for XS2 Core freq = VCO / (OD + 1)
Refer to the xcore Clock Frequency Control document for more details.
Note: This function will not reset the chip and wait for the PLL to settle before re-enabling the chip to allow for large frequency jumps. This will cause a delay during settings.
Note: It is up to the application to ensure that it is safe to change the clock.
Parameters:
ctx – A pointer to the clock control driver instance to use.
Gets the divider stage 1, multiplier stage, and divider stage 2 values from the tile clock PLL control register values on the tile that owns this driver instance.
Parameters:
ctx – A pointer to the clock control driver instance to use.
pre_div – A pointer to be populated with the value of R
mul – A pointer to be populated with the value of F
post_div – A pointer to be populated with the value of OD
Gets the local lock for clock control on the tile that owns this driver instance. This is intended for applications to use to prevent clock changes around critical sections.
Parameters:
ctx – A pointer to the clock control driver instance to use.
The following functions may be used to share a GPIO driver instance with other xcore tiles. Tiles that the
driver instance is shared with may call any of the core functions listed above.
Initializes an RTOS clock control driver instance on a client tile. This allows a tile that does not own the actual driver instance to use a driver instance on another tile. This will be called instead of rtos_clock_control_init(). The host tile that owns the actual instance must simultaneously call rtos_clock_control_rpc_host_init().
Parameters:
cc_ctx – A pointer to the clock control driver instance to initialize.
rpc_config – A pointer to an RPC config struct. This must have the same scope as cc_ctx.
host_intertile_ctx – A pointer to the intertile driver instance to use for performing the communication between the client and host tiles. This must have the same scope as cc_ctx.
Performs additional initialization on a clock control driver instance to allow client tiles to use the clock control driver instance. Each client tile that will use this instance must simultaneously call rtos_clock_control_rpc_client_init().
Parameters:
cc_ctx – A pointer to the clock control driver instance to share with clients.
rpc_config – A pointer to an RPC config struct. This must have the same scope as cc_ctx.
client_intertile_ctx – An array of pointers to the intertile driver instances to use for performing the communication between the host tile and each client tile. This must have the same scope as cc_ctx.
remote_client_count – The number of client tiles to share this driver instance with.
Configures the RPC for a clock control driver instance. This must be called by both the host tile and all client tiles.
On the client tiles this must be called after calling rtos_clock_control_rpc_client_init(). After calling this, the client tile may immediately begin to call the core clock control functions on this driver instance. It does not need to wait for the host to call rtos_clock_control_start().
cc_ctx – A pointer to the clock control driver instance to configure the RPC for.
intertile_port – The port number on the intertile channel to use for transferring the RPC requests and responses for this driver instance. This port must not be shared by any other functions. The port must be the same for the host and all its clients.
host_task_priority – The priority to use for the task on the host tile that handles RPC requests from the clients.
Starts an RTOS intertile driver instance. It may be called either before or after starting the RTOS, but must be called before any of the core intertile driver functions are called with this instance.
rtos_intertile_init() must be called on this intertile driver instance prior to calling this.
Parameters:
intertile_ctx – A pointer to the intertile driver instance to start.
Initializes an RTOS intertile driver instance. This must be called simultaneously on the two tiles establishing an intertile link. It may be called either before or after starting the RTOS, but must be called before calling rtos_intertile_start() or any of the core RTOS intertile functions with this instance.
This establishes a new streaming channel between the two tiles, using the provided non-streaming channel to bootstrap this.
Parameters:
intertile_ctx – A pointer to the intertile driver instance to initialize.
c – A channel end that is already allocated and connected to channel end on the tile with which to establish an intertile link. After this function returns, this channel end is no longer needed and may be deallocated or used for other purposes.
structrtos_intertile_t
#include <rtos_intertile.h>
Struct representing an RTOS intertile driver instance.
The members in this struct should not be accessed directly.
structrtos_intertile_address_t
#include <rtos_intertile.h>
Struct to hold an address to a remote function, consisting of both an intertile instance and a port number. Primarily used by the RPC mechanism in the RTOS drivers.
the buffer returned via msg must be freed by the application using rtos_osal_free().
Note
It is important that no other thread listen on this port simultaneously. If this happens, it is undefined which one will receive the data, and it is possible for a resource exception to occur.
Parameters:
ctx – A pointer to the intertile driver instance to use.
port – The number of the port to listen for data on. Only data sent to this port by the remote tile will be received.
msg – A pointer to the received data is written to this pointer variable. This buffer is obtained from the heap and must be freed by the application using rtos_osal_free().
timeout – The amount of time to wait before data become available.
Initializes the l2 cache for use by the RTOS l2 cache memory driver.
Cache buffer must be dword aligned
RTOS_L2_CACHE_DIRECT_MAP
Convenience macro that may be used to specify the direct map cache to rtos_l2_cache_init() in place of setup_fn and thread_fn.
RTOS_L2_CACHE_TWO_WAY_ASSOCIATIVE
Convenience macro that may be used to specify the two way associative cache to rtos_l2_cache_init() in place of setup_fn and thread_fn.
RTOS_L2_CACHE_BUFFER_WORDS_DIRECT_MAP
Convenience macro that may be used to specify the size of the cache buffer for a direct map cache. A pointer to the buffer of size RTOS_L2_CACHE_BUFFER_WORDS_DIRECT_MAP should be passed to the cache_buffer argument of rtos_l2_cache_init().
RTOS_L2_CACHE_BUFFER_WORDS_TWO_WAY
Convenience macro that may be used to specify the size of the cache buffer for a two way associative cache. A pointer to the buffer of size RTOS_L2_CACHE_BUFFER_WORDS_TWO_WAY should be passed to the cache_buffer argument of rtos_l2_cache_init().
structrtos_l2_cache_struct
#include <rtos_l2_cache.h>
Struct representing an RTOS l2 cache driver instance.
The members in this struct should not be accessed directly.
Services a software memory read request from within the software memory fill interrupt handler. This function may be provided by the application when the software memory driver is initialized with the RTOS_SWMEM_READ_FLAG flag. If the application code to satisfy a fill request requires being run from within an RTOS thread, then rtos_swmem_read_request() should be used instead. Both this handler and rtos_swmem_read_request() may be used together. If the ISR handler is able to satisfy the request it should return true. If it is not, but the request can be satisfied from within rtos_swmem_read_request(), then it should return false.
Parameters:
offset – The byte offset into the software memory of the cache line that has had a cache miss.
buf – This function must fill this with SWMEM_EVICT_SIZE_WORDS words of data. Where this data comes from is up to the application. One example is from a flash memory.
Return values:
true – if the fill request was satisfied.
false – if the fill request was not satisfied. This requires that rtos_swmem_read_request() also be provided.
Services a software memory write request from within the software memory fill interrupt handler. This function may be provided by the application when the software memory driver is initialized with the RTOS_SWMEM_WRITE_FLAG flag. If the application code to satisfy an evict request requires being run from within an RTOS thread, then rtos_swmem_write_request() should be used instead. Both this handler and rtos_swmem_write_request() may be used together. If the ISR handler is able to satisfy the request it should return true. If it is not, but the request can be satisfied from within rtos_swmem_write_request(), then it should return false.
Parameters:
offset – The byte offset into the software memory of the cache line that is being evicted.
dirty_mask – A bytewise dirty mask for the data in buf. The least significant bit corresponds to the lowest byte address in buf and each subsequent byte address corresponds to the next least significant bit.
buf – A pointer to a buffer containing SWMEM_EVICT_SIZE_WORDS words of data from the cache line being evicted. It is up to the application what it does with this data. One example is to write it to flash memory.
Return values:
true – if the evict request was satisifed.
false – if the evict request was not satisfied. This requires that rtos_swmem_write_request() also be provided.
Services a software memory read request from within the software memory RTOS thread. This function may be provided by the application when the software memory driver is initialized with the RTOS_SWMEM_READ_FLAG flag. If rtos_swmem_read_request_isr() is also implemented, then it will be called first. If it is unable to satisfy the request, then this handler will be called. See the description for rtos_swmem_read_request_isr().
Parameters:
offset – The byte offset into the software memory of the cache line that has had a cache miss.
buf – This function must fill this with SWMEM_EVICT_SIZE_WORDS words of data. Where this data comes from is up to the application. One example is from a flash memory.
Services a software memory write request from within the software memory RTOS thread. This function may be provided by the application when the software memory driver is initialized with the RTOS_SWMEM_WRITE_FLAG flag. If rtos_swmem_write_request_isr() is also implemented, then it will be called first. If it is unable to satisfy the request, then this handler will be called. See the description for rtos_swmem_write_request_isr().
Parameters:
offset – The byte offset into the software memory of the cache line that is being evicted.
dirty_mask – A bytewise dirty mask for the data in buf. The least significant bit corresponds to the lowest byte address in buf and each subsequent byte address corresponds to the next least significant bit.
buf – A pointer to a buffer containing SWMEM_EVICT_SIZE_WORDS words of data from the cache line being evicted. It is up to the application what it does with this data. One example is to write it to flash memory.
voidrtos_swmem_start(unsignedpriority)
Starts the RTOS software memory driver.
Parameters:
priority – The priority of the task that gets created by the driver to service the software memory.
voidrtos_swmem_init(uint32_tinit_flags)
Initializes the software memory for use by the RTOS software memory driver.
Parameters:
init_flags – A bitfield consisting of initialization flags.
RTOS_SWMEM_READ_FLAG enables swmem reads.
RTOS_SWMEM_WRITE_FLAG enables swmem writes.
unsignedintrtos_swmem_offset_get()
Return the offset from XS1_SWMEM_BASE to the start of the software memory.
RTOS_SWMEM_READ_FLAG
Flag indicating that software memory reads should be enabled. This should probably always be set when using software memory.
RTOS_SWMEM_WRITE_FLAG
Flag indicating that software memory writes should be enabled. This will not always need to be set, especially if flash is backing the software memory and intended to be read only.
The Device Control Service provides the ability to configure and control an XMOS device from a host over a number of transport layers.
Features of the service include:
Simple read/write API
Fully acknowledged protocol
Includes different transports including I2C and USB.
The table below shows combinations of host and transport mechanisms that are currently supported.
Adding new transport layers and/or hosts is straightforward where the hardware supports it.
This type is used to inform the control library the direction of a control transfer from the transport layer.
Values:
enumeratorCONTROL_HOST_TO_DEVICE
enumeratorCONTROL_DEVICE_TO_HOST
CONTROL_VERSION
This is the version of control protocol. Used to check compatibility
IS_CONTROL_CMD_READ(c)
Checks if the read bit is set in a command code.
Parameters:
c – [in] The command code to check
Returns:
true if the read bit in the command is set
Returns:
false if the read bit is not set
CONTROL_CMD_SET_READ(c)
Sets the read bit on a command code
Parameters:
c – [inout] The command code to set the read bit on.
CONTROL_CMD_SET_WRITE(c)
Clears the read bit on a command code
Parameters:
c – [inout] The command code to clear the read bit on.
CONTROL_SPECIAL_RESID
This is the special resource ID owned by the control library. It can be used to check the version of the control protocol. Servicers may not register this resource ID.
CONTROL_MAX_RESOURCE_ID
The maximum resource ID. IDs greater than this cannot be registered.
CONTROL_GET_VERSION
The command to read the version of the control protocol. It must be sent to resource ID CONTROL_SPECIAL_RESID.
CONTROL_GET_LAST_COMMAND_STATUS
The command to read the return status of the last command. It must be sent to resource ID CONTROL_SPECIAL_RESID.
DEVICE_CONTROL_HOST_MODE
The mode value to use when initializing a device control instance that is on the same tile as its associated transport layer. These may be connected to device control instances on other tiles that have been initialized with DEVICE_CONTROL_CLIENT_MODE.
DEVICE_CONTROL_CLIENT_MODE
The mode value to use when initializing a device control instance that is not on the same tile as its associated transport layer. These must be connected to a device control instance on another tile that has been initialized with DEVICE_CONTROL_HOST_MODE.
DEVICE_CONTROL_CALLBACK_ATTR
This attribute must be specified on all device control command handler callback functions provided by the application.
Function pointer type for application provided device control read command handler callback functions.
Called by device_control_servicer_cmd_recv() when a read command is received from the transport layer. The command consists of a resource ID, command value, and a payload_len. This handler must respond with a payload of the requested length.
Param resid:
[in] Resource ID. Indicates which resource the command is intended for.
Param cmd:
[in] Command code. Note that this will be in the range 0x80 to 0xFF because bit 7 set indicates a read command.
Param payload:
[out] Payload bytes of length payload_len that will be sent back over the transport layer in response to this read command.
Param payload_len:
[in] Requested size of the payload in bytes.
Param app_data:
[inout] A pointer to application specific data provided to device_control_servicer_cmd_recv(). How and if this is used is entirely up to the application.
Return:
CONTROL_SUCCESS if the handling of the read data by the device was successful. An error code otherwise.
Function pointer type for application provided device control write command handler callback functions.
Called by device_control_servicer_cmd_recv() when a write command is received from the transport layer. The command consists of a resource ID, command value, payload, and the payload’s length.
Param resid:
[in] Resource ID. Indicates which resource the command is intended for.
Param cmd:
[in] Command code. Note that this will be in the range 0x80 to 0xFF because bit 7 set indicates a read command.
Param payload:
[in] Payload bytes of length payload_len.
Param payload_len:
[in] The number of bytes in payload.
Param app_data:
[inout] A pointer to application specific data provided to device_control_servicer_cmd_recv(). How and if this is used is entirely up to the application.
Return:
CONTROL_SUCCESS if the handling of the read data by the device was successful. An error code otherwise.
Must be called by the transport layer when a new request is received.
Precisely how each of the three command parameters resid, cmd, and payload_len are received is specific to the transport layer and not defined by this library.
Parameters:
ctx – A pointer to the associated device control instance.
resid – The received resource ID.
cmd – The received command value.
payload_len – The length in bytes of the payload that will follow.
Return values:
CONTROL_SUCCESS – if resid has been registered by a servicer.
CONTROL_BAD_COMMAND – if resid has not been registered by a servicer.
Must be called by the transport layer either when it receives a payload, or when it requires a payload to transmit.
Parameters:
ctx – A pointer to the associated device control instance.
payload_buf – A pointer to the payload buffer.
buf_size – A pointer to a variable containing the size of payload_buf.
When \pdirectionisCONTROL_HOST_TO_DEVICE,nomorethanthisnumberofbyteswillbereadfromit.When \pdirectionisCONTROL_DEVICE_TO_HOST,thiswillbeupdatedtothenumberofbytesactuallywrittento \ppayload_buf.
direction – The direction of the payload transfer.
Must be called by the transport layer when it receives a payload and requires a payload to transmit, for example, in a SPI transfer. The error status returned by the servicer handling the command is updated in the first byte of the tx_buf.
Parameters:
ctx – A pointer to the associated device control instance.
rx_buf – A pointer to the receive payload buffer.
rx_size – A variable containing the size of rx_buf.
Nomorethanthisnumberofbyteswillbereadfromit.
tx_buf – A pointer to the transmitr payload buffer.
tx_size – A pointer variable containing the size of tx_buf.
This is called by servicers to wait for and receive any commands received by the transport layer contain one of the resource IDs registered by the servicer. This is also responsible for responding to read commands.
Parameters:
ctx – A pointer to the device control servicer context to receive commands for.
read_cmd_cb – The callback function to handle read commands for all resource IDs associated with the given servicer.
write_cmd_cb – The callback function to handle write commands for all resource IDs associated with the given servicer.
app_data – A pointer to application specific data to pass along to the provided callback functions. How and if this is used is entirely up to the application.
timeout – The number of RTOS ticks to wait before returning if no command is received.
Return values:
CONTROL_SUCCESS – if a command successfully received and responded to.
CONTROL_ERROR – if no command is received before the function times out, or if there was a problem communicating back to the transport layer thread.
This must be called on the tile that runs the transport layer for the device control instance, and has initialized it with DEVICE_CONTROL_HOST_MODE. This must be called after calling device_control_start() and before the transport layer is started. It is to be run simultaneously with device_control_servicer_register() from other threads on any tiles associated with the device control instance. The number of servicers that must register is specified by the servicer_count parameter of device_control_init().
Parameters:
ctx – A pointer to the device control instance to register resources for.
timeout – The amount of time in RTOS ticks to wait before all servicers register their resource IDs with device_control_servicer_register().
Return values:
CONTROL_SUCCESS – if all servicers successfully register their resource IDs before the timeout.
Registers a servicer for a device control instance. Each servicer is responsible for handling any number of resource IDs. All commands received from the transport layer will be forwarded to the servicer that has registered the resource ID that is found in the command.
Servicers may be registered on any tile that has initialized a device control instance. This must be called after calling device_control_start().
Parameters:
ctx – A pointer to the device control servicer context to initialize.
device_control_ctx – An array of pointers to the device control instance to register the servicer with.
device_control_ctx_count – The number of device control instances to register the servicer with.
resources – Array of resource IDs to associate with this servicer.
num_resources – The number of resource IDs within resources.
Starts a device control instance. This must be called by all tiles that have called device_control_init(). It may be called either before or after starting the RTOS, but must be called before registering the resources and servicers for this instance.
device_control_init() must be called on this device control instance prior to calling this.
Parameters:
ctx – A pointer to the device control instance to start.
intertile_port – The port to use with any and all associated intertile instances associated with this device control instance. If this device control instance is only used by one tile then this is unused.
priority – The priority of the task that will be created if the device control instance was initialized with DEVICE_CONTROL_CLIENT_MODE. This is unused on the tiles where this has been initialized with DEVICE_CONTROL_HOST_MODE. This task is used to listen for commands for a resource ID registered by a servicer running on this tile, but received by the transport layer that is running on another.
This must be called by the tile that runs the transport layer (I2C, USB, etc) for the device control instance, as well as all tiles that will register device control servicers for it. It may be called either before or after starting the RTOS, but must be called before calling device_control_start().
Parameters:
ctx – A pointer to the device control context to initialize.
mode – Set to DEVICE_CONTROL_HOST_MODE if the command transport layer is on the same tile. Set to DEVICE_CONTROL_CLIENT_MODE if the command transport layer is on another tile.
servicer_count – The number of servicers that will be associated with this device control instance.
intertile_ctx – An array of intertile contexts used to communicate with other tiles.
intertile_count – The number of intertile contexts in the intertile_ctx array.
When \pmodeisDEVICE_CONTROL_HOST_MODE,thismaybe0iftherearenoservicersonothertiles,uptooneperdevicecontrolinstancethathasbeeninitializedwithDEVICE_CONTROL_CLIENT_MODEonothertiles.When \pmodeisDEVICE_CONTROL_CLIENT_MODEthenthismustbe1,andtheintertilecontextmustconnecttoadevicecontrolinstanceonanothertilethathasbeeninitializedwithDEVICE_CONTROL_HOST_MODE.
Returns:
CONTROL_SUCCESS if the initialization was successful. An error status otherwise.
structdevice_control_t
#include <device_control.h>
Struct representing a device control instance.
The members in this struct should not be accessed directly.
structdevice_control_client_t
#include <device_control.h>
A device_control_t pointer may be cast to a pointer to this structure type and used with the device control API, provided it is initialized with DEVICE_CONTROL_CLIENT_MODE. This is not necessary to do, but will save a small amount of memory.
structdevice_control_servicer_t
#include <device_control.h>
Struct representing a device control servicer instance.
The members in this struct should not be accessed directly.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$API Reference$$$RTOS Services$$$Device Control$$$Transport protocol for control parameters£££modules/rtos/doc/programming_guide/reference/rtos_services/device_control/device_control_protocol.html#transport-protocol-for-control-parameters
Control parameters are converted to an array of bytes in network byte
order (big endian) before they’re sent over the transport protocol. For
example, to set a control parameter to integer value 305419896 which
corresponds to hex 0x12345678, the array of bytes sent over the
transport protocol would be {0x12, 0x34, 0x56, 0x78}. Similarly, a 4
byte payload {0x00, 0x01, 0x23, 0x22} read over the transport protocol
is interpreted as an integer value 0x00012322.
In addition to the control parameters values, commands include Resource
ID, the Command ID and Payload Length fields that must be communicated
from the host to the device. The Resource ID is an 8-bit identifier that
identifies the resource within the device that the command is for. The
Command ID is an 8-bit identifier used to identify a command for a
resource in the device. Payload length is the length of the data in
bytes that the host wants to write to the device or read from the
device.
The payload length is interpreted differently for GET_ and SET_
commands. For SET_commands, the payload length is simply the number of
bytes worth of control parameters to write to the device. For example,
the payload length for a SET_ command to set a control parameter of
type int32 to a certain value, would be set to 4. For GET_ commands the
payload length is 1 more than the number of bytes of control parameters
to read from the device. For example, a GET_ command to read a
parameter of type int32, payload length would be set to 5. The one extra
byte is used for status and is the first byte (payload[0]) of the
payload received from the device. In the example above, payload[0] would
be the status byte and payload[1]..payload[4] would be the 4 bytes that
make up the value of the control parameter.
The table below lists the different values of the status byte and the
action the user is expected to take for each status:
Values for returned status byte
Return code
Values
Description
ctrl_done
0
Read command successful. The payload bytes contain valid payload returned from the device
ctrl_wait
1
Read command not serviced. Retry until ctrl_done status returned
ctrl_invalid
3
Error in read command. Abort and debug
The GET_commands need the extra status byte since the device might not
return the control parameter value immediately due to timing
constraints. If that is the case the status byte would indicate the
status as ctrl_wait and the user would need to retry the command. When
returned a ctrl_wait, the user is expected to retry the GET_ command
until the status is returned as ctrl_done. The first GET_command is
placed in a queue and it will be serviced by the end of each 15ms audio
frame. Once the status byte indicates ctrl_done, the rest of the bytes
in the payload indicate the control parameter value.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$API Reference$$$RTOS Services$$$Device Control$$$Transporting control parameters over I2C£££modules/rtos/doc/programming_guide/reference/rtos_services/device_control/device_control_protocol.html#transporting-control-parameters-over-i2c
This section describes the I2C command sequence when issuing read and
write commands to the device.
The first byte sent over I2C after start contains the device address and
information about whether this is an I2C read transaction or a write
transaction. This byte is 0x58 for a write command or 0x59 for a read
command. These values are derived by left shifting the device address
(0x2c) by 1 and doing a logical OR of the resulting value with 0 for an
I2C write and 1 for an I2C read.
The bytes sequence sent between I2C start and stop for SET_ commands is
shown in the figure below.
For GET_ commands, the I2C commands sequence consists of a write
command followed by a read command with a repeated start between the 2
commands. The write command writes the resource ID, command ID and the
expected data length to the device and the read command reads the status
byte followed by the rest of the payload that makes up the control
parameter value. The figure below shows the I2C bytes sequence sent and
received for a GET_ command.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$API Reference$$$RTOS Services$$$Device Control$$$Transporting control parameters over USB£££modules/rtos/doc/programming_guide/reference/rtos_services/device_control/device_control_protocol.html#transporting-control-parameters-over-usb
Use the vendor_id 0x20B1, product_id 0x0020 and interface number 0 to
initialize for USB.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$API Reference$$$RTOS Services$$$Device Control$$$Floating point to fixed point (Q format) conversion£££modules/rtos/doc/programming_guide/reference/rtos_services/device_control/device_control_protocol.html#floating-point-to-fixed-point-q-format-conversion
Numbers with fractional parts can be represented as floating-point or
fixed-point numbers. Floating point formats are widely used but carry
performance overheads. Fixed point formats can improve system efficiency
and are used extensively within the XVF3610. Fixed point numbers have
the position of the decimal point fixed and this is indicated as a part
of the format description.
In this document, Q format is used to describe fixed point number
formats, with the representation given as Qm.n format where m
is the number of bits reserved for the sign and integer part of the
number and n is the number of bits reserved for the fractional part of
the number. The position of the decimal point is a trade-off between the
range of values supported and the resolution provided by the fractional
bits.
The dynamic range of Qm.n format is -2m-1 and
2m-1-2-n with a resolution of 2-n
To convert a floating-point format number to Qm.n format
fixed-point number:
Multiply the floating-point number by 2m
Round the result to the nearest integer
The resulting integer number is the Qm.n fixed-point
representation of the initial floating-point number
To convert a Qm.n fixed-point number to floating-point:
Divide the fixed-point number by 2m
The resulting decimal number is a floating-point representation of
the fixed-point number.
Converting a number into fixed point format and then back to a floating
point number may introduce an error of up to ±2-(n+1)
Example:
To represent a floating-point number 14.765467 in Q8.24 format, the
equivalent fixed-point number would be 14.765467 x 224 =
247723429.2 which rounds to 247723429.
To get back the floating-point number given the Q8.24 number 247723429,
calculate 247723429 ÷ 224 and get back the floating-point
number as 14.76546699. The difference of 0.00000001 is correct to with
the error bounds of ±2-25 which is ±0.00000003
The concurrency support sw_service contains a multiple reader single writer lock to support multitheaded applications that need to safely support shared access to a single hardware or software resource. This implementation supports either reader preferred or writer preferred locks.
The generic pipeline service provides a generic construct to create multithreaded pipelines. This can be used to create a variety of sequential operations on data, such as an audio processing pipeline.
The generic_pipeline_init() creates stage_count tasks. In the first stage the application provided input_data function pointer is called. The data then is passed to the first stage_function. After the first state function the data is passed by an RTOS queue to the subsequent stage function. Middle stage functions receive from the previous stage queue, call the stage function, and output to the next stage queue. The last stage function will receive from the previous stage queue, call the stage function, and then call the output_data function pointer.
This code snippet is an example of creating a pipeline to consume a buffer.
Example generic pipeline use
staticvoid*input_func(void*input_app_data){uint32_t*data=pvPortMalloc(100*sizeof(uint32_t));/* Populate some dummy data */for(inti=0;i<100;i++){data[i]=i;}returndata;}staticvoid*output_func(void*data,void*output_app_data){/* Use data here */for(inti=0;i<100;i++){rtos_printf("val[%d] = %d\n",i,(uint32_t*)data[i]);}return1;/* Return nonzero value for generic pipeline to implicitly free the packet */}staticvoidstage0(void*data){/* Perform operation on data here*/;}staticvoidstage1(void*data){/* Perform operation on data here*/;}staticvoidstage2(void*data){/* Perform operation on data here*/;}
The following structures and functions are used to initialize and start a generic pipeline instance.
typedefvoid*(*pipeline_input_t)(void*input_data)
Function pointer type for application provided generic pipeline input callback functions.
Called by the first generic_pipeline_stage() when the stage wants input data. This data pointer is provided to the first stage function to be processed.
This function will create a multistage pipeline, creating a task per stage and connecting them via queues. Each stage task follows the convention:
Get input data
Process data
Push output data
For the first stage, the input data are the provided by the input callback. For the final stage, the output data are provided to the output callback.
Parameters:
input – A function pointer called to get input data
output – A function pointer called to give output data
input_data – A pointer to application specific data to pass to the input callback function
output_data – A pointer to application specific data to pass to the output callback function
stage_functions – An array of stage function pointers
stage_stack_word_sizes – The stack size of each stage. Note: For the first stage must contain enough stack for the stage function + input function. Likewise, the last stage must contain enough stack for the stage function + output function.
pipeline_priority – The priority of all pipeline tasks
stage_count – The number of stages. The limit is 10 stages.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$FAQs$$$What is the memory overhead of the FreeRTOS kernel?£££modules/rtos/doc/programming_guide/faq.html#what-is-the-memory-overhead-of-the-freertos-kernel
The FreeRTOS kernel can be configured to require as little as 9kB of RAM (per tile). In a typical applicaiton, expect the requirement to be closer to 16kB of RAM (per tile).
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$FAQs$$$How do I determine the number of words to allocate for use as a task’s stack?£££modules/rtos/doc/programming_guide/faq.html#how-do-i-determine-the-number-of-words-to-allocate-for-use-as-a-task-s-stack
Since tasks run within FreeRTOS, the RTOS stack requirement must be known at compile time. In FreeRTOS applications on most other microcontrollers, the general practice is to create a task with a large amount of stack, use the FreeRTOS stack debug functions to determine the worst case runtime usage of stack, and then adjust the stack memory value accordingly. The problem with this method is that the stack of any given thread varies greatly based on the functions that are called within, and thus a code or compiler optimization change result in the optimal task stack usage to have to be redetermined. This issue results in many FreeRTOS applications being written in such a way that wastes memory, by providing task with way more stack than they should need. Additionally, stack overflow bugs can remain hidden for a long time and even when bugs do manifest, the source can be difficult to pinpoint.
The XTC Tools address this issue by creating a symbol that represents the maximum stack requirement of any function at compile time. By using the RTOS_THREAD_STACK_SIZE() macro, for the stack words argument for creating a FreeRTOS task, it is guaranteed that the optimal stack requirement is used, provided that the function does not call function pointers nor can infinitely recurse.
If function pointers are used within a thread, then the application programmer must annotate the code with the appropriate function pointer group attribute. For recursive functions, the only option is to specify the stack manually. See Appendix A - Guiding Stack Size Calculation in the XTC Tools documentation for more information.
XCORE ® -VOICE Solutions$$$RTOS Programming Guide$$$FAQs$$$Can I use xcore resources like channels, timers and hw_locks?£££modules/rtos/doc/programming_guide/faq.html#can-i-use-xcore-resources-like-channels-timers-and-hw-locks
You are free to use channels, ports, timers, etc… in your FreeRTOS applications. However, some considerations need to be made. The RTOS kernel knows about RTOS primitives. For example, if RTOS thread A attempts to take a semaphore, the kernel is free to schedule other tasks in thread A’s place while thread A is waiting for some other task to give the semaphore. The RTOS kernel does not know anything about xcore resources. For example, if RTOS thread A attempts to recv on a channel, the kernel is not free to schedule other tasks in its place while thread A is waiting for some other task to send to the other end of the channel. A developer should be aware that blocking calls on xcore resources will block a FreeRTOS thread. This may be OK as long as it is carefully considered in the application design. There are a variety of methods to handle the decoupling of xcore and RTOS resources. These can be best seen in the various RTOS drivers, which wrap the realtime IO hardware imitation layer.
One easy to make mistake in FreeRTOS, is not providing enough stack space for a created task. A vast amount of questions exist online around how to select the FreeRTOS stack size, which the most common answer being to create the task with more than enough stack, force the worst case stack condition (not always trivial), and then use the FreeRTOS debug function uxTaskGetStackHighWaterMark() to determine how much you can decrease the stack. This method leaves plenty of room for error and must be done during runtime, and therefore on a build by build basis. The static analysis tools provided by The XTC Tools greatly simplify this process since they calculate the exact stack required for a given function call. The macro RTOS_THREAD_STACK_SIZE will return the nstackwords symbol for a given thread plus the additional space required for the kernel ISRs. Using this macro for every task create will ensure that there is appropriate stack space for each thread, and thus no stack overflow.
This library provide a software defined UART (universal asynchronous receiver transmitter) allowing you to communicate with other UART enabled devices in your system. A UART is a single wire per direction communications interface allowing either half or full duplex communication. The components in this library are controlled via C and behave as a UART transmitter and/or receiver peripheral.
Various configuration options are available including baud rate (individually settable per direction), number of data bits (between 5 and 8), parity (EVEN, ODD or NONE) and number of stop bits (1 or 2). The UART does not support flow control signals. Only a single 1b IO port per UART direction is needed.
The Tx UART supports up to 1152000 baud unbuffered and 576000 baud buffered with a 75MHz logical core. The Rx UART supports up to 700000 baud unbuffered and 422400 baud buffered with a 75MHz logical core. Proportionally higher rates are achievable using a higher logical core MHz.
The UART receive supports standard error detection including START, PARITY and FRAMING errors. A callback mechanism is included to notify the user of these conditions.
The UART may be used in blocking mode, where the call to Tx/Rx does not return until the stop bit is complete. It may also be used in ISR/buffered mode where the UART Rx and/or Tx operates in background mode using a FIFO and callbacks to manage data-flow and error conditions. Cycles are stolen from the logical core which setup the interrupt. In ISR/buffered mode additional callbacks are supported indicating the UNDERRUN condition when the Tx buffer is empty and OVERRUN when the Rx buffer is full.
UART data wires
Tx
Transmit line controlled by UART Tx
Rx
Receive line controlled by UART Rx
All UART functions can be accessed via the uart.h header:
The following code snippet demonstrates the basic blocking usage of an UART Tx device.
#include<xs1.h>#include"uart.h"uart_tx_tuart;port_tp_uart_tx=XS1_PORT_1A;hwtimer_ttmr=hwtimer_alloc();uint8_ttx_data[4]={0x01,0x02,0x04,0x08};// Initialize the UART Txuart_tx_blocking_init(&uart,p_uart_tx,115200,8,UART_PARITY_NONE,1,tmr);// Transfer some datafor(inti=0;i<sizeof(tx_data);i++){uart_tx(&uart,tx_data[i]);}
The following code snippet demonstrates the usage of an UART Tx device used in ISR/Buffered mode:
#include<xs1.h>#include"uart.h"HIL_UART_TX_CALLBACK_ATTRvoidtx_empty_callback(void*app_data){int*tx_empty=(int*)app_data;*tx_empty=1;}voiduart_tx(void){uart_tx_tuart;port_tp_uart_tx=XS1_PORT_1A;hwtimer_ttmr=hwtimer_alloc();uint8_tbuffer[64+1]={0};// Note buffer size plus oneuint8_ttx_data[4]={0x01,0x02,0x04,0x08};volatileinttx_empty=0;// Initialize the UART Txuart_tx_init(&uart,p_uart_tx,115200,8,UART_PARITY_NONE,1,tmr,buffer,sizeof(buffer),tx_empty_callback,&tx_empty);// Transfer some datafor(inti=0;i<sizeof(tx_data);i++){uart_tx(&uart,tx_data[i]);}// Wait for it to completewhile(!tx_empty);
baud_rate – The baud rate of the UART in bits per second.
data_bits – The number of data bits per frame sent.
parity – The type of parity used. See uart_parity_t above.
stop_bits – The number of stop bits asserted at the of the frame.
tmr – The resource id of the timer to be used. Polling mode will be used if set to 0.
tx_buff – Pointer to a buffer. Optional. If set to zero the UART will run in blocking mode. If initialised to a valid buffer, the UART will be interrupt driven.
buffer_size_plus_one – Size of the buffer if enabled in tx_buff. Note that the buffer allocation and size argument must be one greater than needed. Eg. buff[65] for a 64 byte buffer.
uart_tx_empty_callback_fptr – Callback function pointer for UART buffer empty in buffered mode.
app_data – A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Define which sets the enum start point of RX errors. This is relied upon by the RTOS drivers and allows optimisation of error handling.
HIL_UART_TX_CALLBACK_ATTR
This attribute must be specified on the UART TX UNDERRUN callback function provided by the application. It ensures the correct stack usage is calculated.
HIL_UART_RX_CALLBACK_ATTR
This attribute must be specified on the UART Rx callback functions (both ERROR and Rx complete callbacks) provided by the application. It ensures the correct stack usage is correctly calculated.
structuart_tx_t
#include <uart.h>
Struct to hold a UART Tx context.
The members in this struct should not be accessed directly. Use the API provided instead.
The following code snippet demonstrates the basic usage of an UART Rx device where the function call to Rx returns after the stop bit has been sampled. The function blocks until a complete byte has been received.
#include<xs1.h>#include<print.h>#include"uart.h"HIL_UART_RX_CALLBACK_ATTRvoidrx_error_callback(uart_callback_code_tcallback_code,void*app_data){switch(callback_code){caseUART_START_BIT_ERROR:printstrln("UART_START_BIT_ERROR");break;caseUART_PARITY_ERROR:printstrln("UART_PARITY_ERROR");break;caseUART_FRAMING_ERROR:printstrln("UART_FRAMING_ERROR");test_abort=1;break;caseUART_OVERRUN_ERROR:printstrln("UART_OVERRUN_ERROR");break;caseUART_UNDERRUN_ERROR:printstrln("UART_UNDERRUN_ERROR");break;default:printstr("Unexpected callback code: ");printintln(callback_code);}}voiduart_rx(void){uart_rx_tuart;port_tp_uart_rx=XS1_PORT_1B;hwtimer_ttmr=hwtimer_alloc();chartest_rx[16];// Initialize the UART Rxuart_rx_blocking_init(&uart,p_uart_rx,115200,8,UART_PARITY_NONE,1,tmr,rx_error_callback,&uart);// Receive some datafor(inti=0;i<sizeof(rx_data);i++){test_rx[i]=uart_rx(&uart);}
The following code snippet demonstrates the usage of an UART Rx device used in ISR/Buffered mode:
#include<xs1.h>#include<print.h>#include"uart.h"HIL_UART_RX_CALLBACK_ATTRvoidrx_error_callback(uart_callback_code_tcallback_code,void*app_data){switch(callback_code){caseUART_START_BIT_ERROR:printstrln("UART_START_BIT_ERROR");break;caseUART_PARITY_ERROR:printstrln("UART_PARITY_ERROR");break;caseUART_FRAMING_ERROR:printstrln("UART_FRAMING_ERROR");test_abort=1;break;caseUART_OVERRUN_ERROR:printstrln("UART_OVERRUN_ERROR");break;caseUART_UNDERRUN_ERROR:printstrln("UART_UNDERRUN_ERROR");break;default:printstr("Unexpected callback code: ");printintln(callback_code);}}HIL_UART_RX_CALLBACK_ATTRvoidrx_callback(void*app_data){unsigned*bytes_received=(unsigned*)app_data;*bytes_received+=1;}voiduart_rx(void){uart_rx_tuart;port_tp_uart_rx=XS1_PORT_1A;hwtimer_ttmr=hwtimer_alloc();uint8_tbuffer[64+1]={0};// Note buffer size plus onevolatileunsignedbytes_received=0;// Initialize the UART Rxuart_rx_init(&uart,p_uart_rx,115200,8,UART_PARITY_NONE,1,tmr,buffer,sizeof(buffer),rx_callback,&bytes_received);// Wait for 16b of datawhile(bytes_received<15);// Get the datauint8_ttest_rx[NUM_RX_WORDS];for(inti=0;i<16;i++){test_rx[i]=uart_rx(&uart);}
baud_rate – The baud rate of the UART in bits per second.
data_bits – The number of data bits per frame sent.
parity – The type of parity used. See uart_parity_t above.
stop_bits – The number of stop bits asserted at the of the frame.
tmr – The resource id of the timer to be used. Polling mode will be used if set to 0.
rx_buff – Pointer to a buffer. Optional. If set to zero the UART will run in blocking mode. If initialised to a valid buffer, the UART will be interrupt driven.
buffer_size_plus_one – Size of the buffer if enabled in rx_buff. Note that the buffer allocation and size argument must be one greater than needed. Eg. buff[65] for a 64 byte buffer.
uart_rx_complete_callback_fptr – Callback function pointer for UART rx complete (one word) in buffered mode only. Optionally NULL.
uart_rx_error_callback_fptr – Callback function pointer for UART rx errors The error is contained in cb_code in the uart_rx_t struct.
app_data – A pointer to application specific data provided by the application. Used to share data between this callback function and the application.
Initializes a UART Rx I/O interface. This API is fixed to blocking mode which is where the call to uart_rx returns as soon as the stop bit has been sampled.
baud_rate – The baud rate of the UART in bits per second.
data_bits – The number of data bits per frame sent.
parity – The type of parity used. See uart_parity_t above.
stop_bits – The number of stop bits asserted at the of the frame.
tmr – The resource id of the timer to be used. Polling mode will be used if set to 0.
uart_rx_error_callback_fptr – Callback function pointer for UART rx errors The error is contained in cb_code in the uart_rx_t struct.
app_data – A pointer to application specific data provided by the application. Used to share data between the error callback function and the application.
A software defined I2C library that allows you to control an I2C bus via xcore ports. I2C is a two-wire hardware serial interface, first developed by Philips. The components in the library are controlled via C and can either act as I2C master or slave.
The library is compatible with multiple slave devices existing on the same bus. The I2C master component can be used by multiple tasks within the xcore device (each addressing the same or different slave devices).
The library can also be used to implement multiple I2C physical interfaces on a single xcore device simultaneously.
All signals are designed to comply with the timings in the I2C specification.
Note that the following optional parts of the I2C specification are not supported:
Multi-master arbitration
10-bit slave addressing
General call addressing
Software reset
START byte
Device ID
Fast-mode Plus, High-speed mode, Ultra Fast-mode
I2C consists of two signals: a clock line (SCL) and a data line (SDA). Both these signals are open-drain and require external resistors to pull the line up if no device is driving the signal down. The correct value for the resistors can be found in the I2C specification.
All I2C functions can be accessed via the i2c.h header:
The following code snippet demonstrates the basic usage of an I2C master device.
#include<xs1.h>#include"i2c.h"i2c_master_ti2c_ctx;port_tp_scl=XS1_PORT_1A;port_tp_sda=XS1_PORT_1B;uint8_tdata[1]={0x99};// Initialize the masteri2c_master_init(&i2c_ctx,p_scl,0,0,p_sda,0,0,100);// Write some datai2c_master_write(&i2c_ctx,0x33,data,1,NULL,1);// Shutdowni2c_master_shutdown(&i2c_ctx);
device_addr – The address of the device to write to.
buf – The buffer containing data to write.
n – The number of bytes to write.
num_bytes_sent – The function will set this value to the number of bytes actually sent. On success, this will be equal to n but it will be less if the slave sends an early NACK on the bus and the transaction fails.
send_stop_bit – If this is non-zero then a stop bit will be sent on the bus after the transaction. This is usually required for normal operation. If this parameter is zero then no stop bit will be omitted. In this case, no other task can use the component until a stop bit has been sent.
Returns:
I2C_ACK if the write was acknowledged by the device, I2C_NACK otherwise.
device_addr – The address of the device to read from.
buf – The buffer to fill with data.
n – The number of bytes to read.
send_stop_bit – If this is non-zero then a stop bit. will be sent on the bus after the transaction. This is usually required for normal operation. If this parameter is zero then no stop bit will be omitted. In this case, no other task can use the component until a stop bit has been sent.
Returns:
I2C_ACK if the read was acknowledged by the device, I2C_NACK otherwise.
This function will cause a stop bit to be sent on the bus. It should be used to complete/abort a transaction if the send_stop_bit argument was not set when calling the i2c_master_read() or i2c_master_write() functions.
Implements an I2C master device on one or two single or multi-bit ports.
Parameters:
ctx – A pointer to the I2C master context to initialize.
p_scl – The port containing SCL. This may be either the same as or different than p_sda.
scl_bit_position – The bit number of the SCL line on the port p_scl.
scl_other_bits_mask – A value that is ORed into the port value driven to p_scl both when SCL is high and low. The bit representing SCL (as well as SDA if they share the same port) must be set to 0.
p_sda – The port containing SDA. This may be either the same as or different than p_scl.
sda_bit_position – The bit number of the SDA line on the port p_sda.
sda_other_bits_mask – A value that is ORed into the port value driven to p_sda both when SDA is high and low. The bit representing SDA (as well as SCL if they share the same port) must be set to 0.
kbits_per_second – The speed of the I2C bus. The maximum value allowed is 400.
The following code snippet demonstrates the basic usage of an I2C slave device.
#include<xs1.h>#include"i2c.h"port_tp_scl=XS1_PORT_1A;port_tp_sda=XS1_PORT_1B;// Setup callbacks// NOTE: See API or SDK examples for more on using the callbacksi2c_callback_group_ti_i2c={.ack_read_request=(ack_read_request_t)i2c_ack_read_req,.ack_write_request=(ack_write_request_t)i2c_ack_write_req,.master_requires_data=(master_requires_data_t)i2c_master_req_data,.master_sent_data=(master_sent_data_t)i2c_master_sent_data,.stop_bit=(stop_bit_t)i2c_stop_bit,.shutdown=(shutdown_t)i2c_shutdown,.app_data=NULL,};// Start the slave device in this thread// NOTE: You may wish to launch the slave device in a different thread.// See the XTC Tools documentation reference for lib_xcore.i2c_slave(&i_i2c,p_scl,p_sda,0x3c);
i2c_cbg – The I2C callback group pointing to the application’s functions to use for initialization and getting and receiving frames. Also points to application specific data which will be shared between the callbacks.
p_scl – The SCL port of the I2C bus. This should be a 1 bit port. If not, The SCL pin must be at bit 0 and the other bits unused.
p_sda – The SDA port of the I2C bus. This should be a 1 bit port. If not, The SDA pin must be at bit 0 and the other bits unused.
device_addr – The address of the slave device.
I2C_CALLBACK_ATTR
This attribute must be specified on all I2C callback functions provided by the application.
structi2c_callback_group_t
#include <i2c.h>
Callback group representing callback events that can occur during the operation of the I2C slave task. Must be initialized by the application prior to passing it to one of the I2C tasks.
This function reads from an 8-bit addressed, 8-bit register in an I2C device. The function reads the data by sending the register address followed reading the register data from the device at the specified device address.
Note
No stop bit is transmitted between the write and the read. The operation is performed as one transaction using a repeated start.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to read from.
reg – The address of the register to read from.
result – Indicates whether the read completed successfully. Will be set to I2C_REGOP_DEVICE_NACK if the slave NACKed, and I2C_REGOP_SUCCESS on successful completion of the read.
This function reads from an 16-bit addressed, 8-bit register in an I2C device. The function reads the data by sending the register address followed reading the register data from the device at the specified device address.
Note
No stop bit is transmitted between the write and the read. The operation is performed as one transaction using a repeated start.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to read from.
reg – The address of the register to read from.
result – Indicates whether the read completed successfully. Will be set to I2C_REGOP_DEVICE_NACK if the slave NACKed, and I2C_REGOP_SUCCESS on successful completion of the read.
This function reads from an 8-bit addressed, 16-bit register in an I2C device. The function reads the data by sending the register address followed reading the register data from the device at the specified device address.
Note
No stop bit is transmitted between the write and the read. The operation is performed as one transaction using a repeated start.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to read from.
reg – The address of the register to read from.
result – Indicates whether the read completed successfully. Will be set to I2C_REGOP_DEVICE_NACK if the slave NACKed, and I2C_REGOP_SUCCESS on successful completion of the read.
This function reads from an 16-bit addressed, 16-bit register in an I2C device. The function reads the data by sending the register address followed reading the register data from the device at the specified device address.
Note
No stop bit is transmitted between the write and the read. The operation is performed as one transaction using a repeated start.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to read from.
reg – The address of the register to read from.
result – Indicates whether the read completed successfully. Will be set to I2C_REGOP_DEVICE_NACK if the slave NACKed, and I2C_REGOP_SUCCESS on successful completion of the read.
This function writes to an 8-bit addressed, 8-bit register in an I2C device. The function writes the data by sending the register address followed by the register data to the device at the specified device address.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to write to.
This function writes to a 16-bit addressed, 8-bit register in an I2C device. The function writes the data by sending the register address followed by the register data to the device at the specified device address.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to write to.
This function writes to an 8-bit addressed, 16-bit register in an I2C device. The function writes the data by sending the register address followed by the register data to the device at the specified device address.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to write to.
This function writes to a 16-bit addressed, 16-bit register in an I2C device. The function writes the data by sending the register address followed by the register data to the device at the specified device address.
Parameters:
ctx – A pointer to the I2C master context to use.
device_addr – The address of the device to write to.
A software defined library that allows you to control an I2S (Inter-IC Sound) bus via xcore ports. I2S is a digital data streaming interfaces particularly appropriate for transmission of audio data. TDM is a special case of I2S which supports transport of more than two audio channels and is partially included in the library at this time. The components in the library are controlled via C and can either act as I2S master, I2S slave or TDM slave.
Note
TDM is only currently supported as a TDM16 slave Tx component. Expansion of this library to support master or slave Rx is possible and can be done on request.
I2S is a protocol between two devices where one is the master and one is the slave which determines who drives the clock lines. The protocol is made up of four signals shown in I2S data wires.
I2S data wires
MCLK
Clock line, driven by external oscillator. This signal is optional.
BCLK
Bit clock. This is a fixed divide of the MCLK and is driven
by the master.
LRCLK (or WCLK)
Word clock (or word select). This is driven by the master.
DATA
Data line, driven by one of the slave or master depending on
the data direction. There may be several data lines in
differing directions.
All I2S functions can be accessed via the i2s.h header:
#include"i2s.h"
TDM is a protocol between two devices similar to I2S where one is the master and one is the slave which determines who drives the clock lines. The protocol is made up of four signals shown in TDM data wires.
TDM data wires
MCLK
Clock line, driven by external oscillator. This signal is optional.
BCLK
Bit clock. This is a fixed divide of the MCLK and is driven
by the master.
FSYCNH
Frame synchronization. Toggles at the start of the TDM data frame. This is driven by the master.
DATA
Data line, driven by one of the slave or master depending on
the data direction. There may be several data lines in
differing directions.
Currently supported TDM functions can be accessed via the i2s_tdm_slave.h header:
The macro I2S_DATA_WIDTH may be set as a compile flag (e.g.
-DI2S_DATA_WIDTH=16) to alter the number of bits per word for both the I2S
Master and I2S Slave components; this defaults to 32 bits per word. This
value may be set to any value between 1 and 32. Correct operation of the I2S
components has only currently been verified at 16 and 32 bits per word.
The following structures and functions are used by an I2S master or slave instance.
enumi2s_mode
I2S mode.
This type is used to describe the I2S mode.
Values:
enumeratorI2S_MODE_I2S
The LR clock transitions ahead of the data by one bit clock.
enumeratorI2S_MODE_LEFT_JUSTIFIED
The LR clock and data are phase aligned.
enumi2s_slave_bclk_polarity
I2S slave bit clock polarity.
Standard I2S is positive, that is toggle data and LR clock on falling edge of bit clock and sample them on rising edge of bit clock. Some masters have it the other way around.
Values:
enumeratorI2S_SLAVE_SAMPLE_ON_BCLK_RISING
Toggle falling, sample rising (default if not set)
enumeratorI2S_SLAVE_SAMPLE_ON_BCLK_FALLING
Toggle rising, sample falling
enumi2s_restart
Restart command type.
Restart commands that can be signalled to the I2S or TDM component.
Values:
enumeratorI2S_NO_RESTART
Do not restart.
enumeratorI2S_RESTART
Restart the bus (causes the I2S/TDM to stop and a new init callback to occur allowing reconfiguration of the BUS).
enumeratorI2S_SHUTDOWN
Shutdown. This will cause the I2S/TDM component to exit.
Standard I2S is positive, that is toggle data and LR clock on falling edge of bit clock and sample them on rising edge of bit clock. Some masters have it the other way around.
The I2S component will call this when it first initializes on first run of after a restart.
This will contain the TDM context when in TDM mode.
Param app_data:
Points to application specific data supplied by the application. May be used for context data specific to each I2S task instance.
Param i2s_config:
This structure is provided if the connected component drives an I2S bus. The members of the structure should be set to the required configuration. This is ignored when used in TDM mode.
This callback will be called when a new frame of samples is read in by the I2S task.
Param app_data:
Points to application specific data supplied by the application. May be used for context data specific to each I2S task instance.
Param num_in:
The number of input channels contained within the array.
Param samples:
The samples data array as signed 32-bit values. The component may not have 32-bits of accuracy (for example, many I2S codecs are 24-bit), in which case the bottom bits will be arbitrary values.
This callback will be called when the I2S task needs a new frame of samples.
Param app_data:
Points to application specific data supplied by the application. May be used for context data specific to each I2S task instance.
Param num_out:
The number of output channels contained within the array.
Param samples:
The samples data array as signed 32-bit values. The component may not have 32-bits of accuracy (for example, many I2S codecs are 24-bit), in which case the bottom bits will be arbitrary values.
I2S_MAX_DATALINES
I2S_CHANS_PER_FRAME
I2S_CALLBACK_ATTR
This attribute must be specified on all I2S callback functions provided by the application.
structi2s_config
#include <i2s.h>
I2S configuration structure.
This structure describes the configuration of an I2S bus.
structi2s_callback_group_t
#include <i2s.h>
Callback group representing callback events that can occur during the operation of the I2S task. Must be initialized by the application prior to passing it to one of the I2S tasks.
The TDM component will call this after it first initializes the ports. This gives the app the chance to make adjustments to port timing which are often needed when clocking above 15MHz.
Param i2s_tdm_ctx:
Points to i2s_tdm_ctx_t struct allowing the resources to be modified after they have been enabled and initialised.
I2S_TDM_MAX_POUT_CNT
I2S_TDM_MAX_PIN_CNT
I2S_TDM_MAX_CH_PER_FRAME
TDM_CALLBACK_ATTR
This attribute must be specified on the TDM callback function provided by the application.
structi2s_tdm_ctx_t
#include <i2s_tdm_slave.h>
Struct to hold an I2S TDM context.
The members in this struct should not be accessed directly.
The following code snippet demonstrates the basic usage of an I2S master device.
#include<xs1.h>#include"i2s.h"port_tp_i2s_dout[1];port_tp_bclk;port_tp_lrclk;port_tp_mclk;xclock_tbclk;i2s_callback_group_ti2s_cb_group;// Setup ports and clocksp_i2s_dout[0]=PORT_I2S_DAC_DATA;p_bclk=PORT_I2S_BCLK;p_lrclk=PORT_I2S_LRCLK;p_mclk=PORT_MCLK_IN;bclk=I2S_CLKBLK;port_enable(p_mclk);port_enable(p_bclk);// NOTE: p_lrclk does not need to be enabled by the caller// Setup callbacks// NOTE: See API or SDK examples for more on using the callbacksi2s_cb_group.init=(i2s_init_t)i2s_init;i2s_cb_group.restart_check=(i2s_restart_check_t)i2s_restart_check;i2s_cb_group.receive=(i2s_receive_t)i2s_receive;i2s_cb_group.send=(i2s_send_t)i2s_send;i2s_cb_group.app_data=NULL;// Start the master device in this thread// NOTE: You may wish to launch the slave device in a different thread.// See the XTC Tools documentation reference for lib_xcore.i2s_master(&i2s_cb_group,p_i2s_dout,1,NULL,0,p_bclk,p_lrclk,p_mclk,bclk);
This task performs I2S on the provided pins. It will perform callbacks over the i2s_callback_group_t callback group to get/receive frames of data from the application using this component.
The task performs I2S master so will drive the word clock and bit clock lines.
Parameters:
i2s_cbg – The I2S callback group pointing to the application’s functions to use for initialization and getting and receiving frames. Also points to application specific data which will be shared between the callbacks.
p_dout – An array of data output ports
num_out – The number of output data ports
p_din – An array of data input ports
num_in – The number of input data ports
p_bclk – The bit clock output port
p_lrclk – The word clock output port
p_mclk – Input port which supplies the master clock
bclk – A clock that will get configured for use with the bit clock
This task differs from i2s_master() in that bclk must already be configured to the BCLK frequency. Other than that, it is identical.
This task performs I2S on the provided pins. It will perform callbacks over the i2s_callback_group_t callback group to get/receive frames of data from the application using this component.
The task performs I2S master so will drive the word clock and bit clock lines.
Parameters:
i2s_cbg – The I2S callback group pointing to the application’s functions to use for initialization and getting and receiving frames. Also points to application specific data which will be shared between the callbacks.
p_dout – An array of data output ports
num_out – The number of output data ports
p_din – An array of data input ports
num_in – The number of input data ports
p_bclk – The bit clock output port
p_lrclk – The word clock output port
bclk – A clock that is configured externally to be used as the bit clock
The following code snippet demonstrates the basic usage of an I2S slave device.
#include<xs1.h>#include"i2s.h"// Setup ports and clocksport_tp_bclk=XS1_PORT_1B;port_tp_lrclk=XS1_PORT_1C;port_tp_din[4]={XS1_PORT_1D,XS1_PORT_1E,XS1_PORT_1F,XS1_PORT_1G};port_tp_dout[4]={XS1_PORT_1H,XS1_PORT_1I,XS1_PORT_1J,XS1_PORT_1K};xclock_tbclk=XS1_CLKBLK_1;port_enable(p_bclk);// NOTE: p_lrclk does not need to be enabled by the caller// Setup callbacks// NOTE: See API or SDK examples for more on using the callbacksi2s_callback_group_ti_i2s={.init=(i2s_init_t)i2s_init,.restart_check=(i2s_restart_check_t)i2s_restart_check,.receive=(i2s_receive_t)i2s_receive,.send=(i2s_send_t)i2s_send,.app_data=NULL,};// Start the slave device in this thread// NOTE: You may wish to launch the slave device in a different thread.// See the XTC Tools documentation reference for lib_xcore.i2s_slave(&i_i2s,p_dout,4,p_din,4,p_bclk,p_lrclk,bclk);
This task performs I2S on the provided pins. It will perform callbacks over the i2s_callback_group_t callback group to get/receive data from the application using this component.
The component performs I2S slave so will expect the word clock and bit clock to be driven externally.
Parameters:
i2s_cbg – The I2S callback group pointing to the application’s functions to use for initialization and getting and receiving frames. Also points to application specific data which will be shared between the callbacks.
p_dout – An array of data output ports
num_out – The number of output data ports
p_din – An array of data input ports
num_in – The number of input data ports
p_bclk – The bit clock input port
p_lrclk – The word clock input port
bclk – A clock that will get configured for use with the bit clock
The following code snippet demonstrates the basic usage of a TDM slave Tx device.
#include<xs1.h>#include"i2s_tdm_slave.h"// Setup ports and clocksport_tp_bclk=XS1_PORT_1A;port_tp_fsync=XS1_PORT_1B;port_tp_dout=XS1_PORT_1C;xclock_tclk_bclk=XS1_CLKBLK_1;// Setup callbacks// NOTE: See API or sln_voice examples for more on using the callbacksi2s_tdm_ctx_tctx;i2s_callback_group_ti_i2s={.init=(i2s_init_t)i2s_init,.restart_check=(i2s_restart_check_t)i2s_restart_check,.receive=NULL,.send=(i2s_send_t)i2s_send,.app_data=NULL,};// Initialize the TDM slavei2s_tdm_slave_tx_16_init(&ctx,&i_i2s,p_dout,p_fsync,p_bclk,clk_bclk,0,I2S_SLAVE_SAMPLE_ON_BCLK_FALLING,NULL);// Start the slave device in this thread// NOTE: You may wish to launch the slave device in a different thread.// See the XTC Tools documentation reference for lib_xcore.i2s_tdm_slave_tx_16_thread(&ctx);
i2s_cbg – The I2S callback group pointing to the application’s functions to use for initialization and getting and receiving frames. For TDM the app_data variable within this struct is NOT used.
p_dout – The data output port. MUST be a 1b port
p_fsync – The fsync input port. MUST be a 1b port
p_bclk – The bit clock input port. MUST be a 1b port
bclk – A clock that will get configured for use with the bit clock
tx_offset – The number of bclks from FSYNC transition to the MSB of Slot 0
slave_bclk_pol – The polarity of bclk
tdm_post_port_init – Callback to be called just after resource init. Allows for modification of port timing for >15MHz clocks. Set to NULL if not needed.
This task performs I2S TDM slave on the provided context which was initialized with i2s_tdm_slave_tx_16_init(). It will perform callbacks over the i2s_callback_group_t callback group to get data from the application using this component.
This thread assumes 1 data output port, 32b word length, 32b channel length, and 16 channels per frame.
The component performs I2S TDM slave so will expect the fsync and bit clock to be driven externally.
A software defined SPI (serial peripheral interface) library that allows you to control a SPI bus via the xcore GPIO hardware-response ports. SPI is a four-wire hardware bi-directional serial interface. The components in the library are controlled via C and can either act as SPI master or slave.
The SPI bus can be used by multiple tasks within the xcore device and (each addressing the same or different slaves) and is compatible with other slave devices on the same bus.
The SPI protocol requires a clock, one or more slave selects and either one or two data wires.
SPI data wires
SCLK
Clock line, driven by the master
MOSI
Master Output, Slave Input data line, driven by the master
MISO
Master Input, Slave Output data line, driven by the slave
SS
Slave select line, driven by the master
All SPI functions can be accessed via the spi.h header:
The following code snippet demonstrates the basic usage of an SPI master device.
#include<xs1.h>#include"spi.h"spi_master_tspi_ctx;spi_master_device_tspi_dev;port_tp_miso=XS1_PORT_1A;port_tp_ss[1]={XS1_PORT_1B};port_tp_sclk=XS1_PORT_1C;port_tp_mosi=XS1_PORT_1D;xclock_tcb=XS1_CLKBLK_1;uint8_ttx[4]={0x01,0x02,0x04,0x08};uint8_trx[4];// Initialize the master devicespi_master_init(&spi_ctx,cb,p_ss[0],p_sclk,p_mosi,p_miso);spi_master_device_init(&spi_dev,&spi_ctx,1,SPI_MODE_0,spi_master_source_clock_ref,0,spi_master_sample_delay_0,0,0,0,0);// Transfer some dataspi_master_start_transaction(&spi_ctx);spi_master_transfer(&spi_ctx,(uint8_t*)tx,(uint8_t*)rx,4);spi_master_end_transaction(&spi_ctx);
This callback function will be called when the SPI master has asserted this slave’s chip select.
The input and output buffer may be the same; however, partial byte/incomplete reads will result in out_buf bits being masked off due to a partial bit output.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between
This callback function will be called when the SPI master has de-asserted this slave’s chip select.
The value of bytes_read contains the number of full bytes that are in in_buf. When read_bits is greater than 0, the byte after the last full byte contains the partial bits read.
Param app_data:
A pointer to application specific data provided by the application. Used to share data between
Param out_buf:
The buffer that had been provided to be sent to the master
Param bytes_written:
The length in bytes of out_buf that had been written
Param in_buf:
The buffer that had been provided to be received into from the master
Param bytes_read:
The length in bytes of in_buf that has been read in to
Note: To guarantee timing in all situations, the SPI I/O interface implicitly sets the fast mode and high priority status register bits for the duration of SPI operations. This may reduce the MIPS of other threads based on overall system setup.
Initialize a SPI device. Multiple SPI devices may be initialized per SPI interface. Each must be on a unique pin of the interface’s chip select port.
Parameters:
dev – The context representing the device to initialize.
spi – The context representing the SPI master interface that the device is connected to.
cs_pin – The bit number of the chip select port that is connected to the device’s chip select pin.
cpol – The clock polarity required by the device.
cpha – The clock phase required by the device.
source_clock – The source clock to derive SCLK from. See spi_master_source_clock_t.
clock_divisor – The value to divide the source clock by. The frequency of SCLK will be set to:
(F_src) / (4 * clock_divisor) when clock_divisor > 0
(F_src) / (2) when clock_divisor = 0 Where F_src is the frequency of the source clock.
miso_sample_delay – When to sample MISO. See spi_master_sample_delay_t.
miso_pad_delay – The number of core clock cycles to delay sampling the MISO pad during a transaction. This allows for more fine grained adjustment of sampling time. The value may be between 0 and 5.
cs_to_clk_delay_ticks – The minimum number of reference clock ticks between assertion of chip select and the first clock edge.
clk_to_cs_delay_ticks – The minimum number of reference clock ticks between the last clock edge and de-assertion of chip select.
cs_to_cs_delay_ticks – The minimum number of reference clock ticks between transactions, which is between de-assertion of chip select and the end of one transaction, and its re-assertion at the beginning of the next.
The following code snippet demonstrates the basic usage of an SPI slave device.
#include<xs1.h>#include"spi.h"// Setup callbacks// NOTE: See API or SDK examples for more on using the callbacksspi_slave_callback_group_tspi_cbg={.slave_transaction_started=(slave_transaction_started_t)start,.slave_transaction_ended=(slave_transaction_ended_t)end,.app_data=NULL};port_tp_miso=XS1_PORT_1A;port_tp_cs=XS1_PORT_1B;port_tp_sclk=XS1_PORT_1C;port_tp_mosi=XS1_PORT_1D;xclock_tcb=XS1_CLKBLK_1;// Start the slave device in this thread// NOTE: You may wish to launch the slave device in a different thread.// See the XTC Tools documentation reference for lib_xcore.spi_slave(&spi_cbg,p_sclk,p_mosi,p_miso,p_cs,cb,SPI_MODE_0);
The CS to first clock minimum delay, sometimes referred to as setup time, will vary based on the duration of the slave_transaction_started callback. This parameter will be application specific. To determine the typical value, time the duration of the slave_transaction_started callback, and add 2000ns as a safety factor. If slave_transaction_started has a non-deterministic runtime, perhaps due to waiting on an XCORE resource, then the application developer must decide an appropriate CS to first SCLK specification.
The minimum delay between consecutive transactions varies based on SPI mode, and if MISO is used.
p_sclk – The SPI slave’s SCLK port. Must be a 1-bit port.
p_mosi – The SPI slave’s MOSI port. Must be a 1-bit port.
p_miso – The SPI slave’s MISO port. Must be a 1-bit port.
p_cs – The SPI slave’s CS port. Must be a 1-bit port.
clock_block – The clock block to use for the SPI slave.
cpol – The clock polarity to use.
cpha – The clock phase to use.
structspi_slave_callback_group_t
#include <spi.h>
Callback group representing callback events that can occur during the operation of the SPI slave task. Must be initialized by the application prior to passing it to one of the SPI slaves.
lib_mic_array is a library for capturing and processing PDM microphone data
on xcore.ai devices.
PDM microphones are a kind of ‘digital microphone’ which captures audio data as
a stream of 1-bit samples at a very high sample rate. The high sample rate PDM
stream is captured by the device, filtered and decimated to a 32-bit PCM audio
stream.
First stage has fixed tap count of 256 and decimation factor of 32
Second stage has fully configurable tap count and decimation factor
Custom filter coefficients can be used for either stage
Reference filter with total decimation factor of 192 is provided (16 kHz
output sample rate w/ 3.072 MHz PDM clock).
Filter generation scripts and examples are included to support 32 kHz and 48 kHz.
Supports 1-, 4- and 8-bit ports.
Supports 1 to 16 microphones
Includes ability to capture samples on a subset of a port’s pins (e.g. 3 PDM
microphones may be used with a 4- or 8-bit port)
Also supports microphone channel index remapping
Optional DC offset elimination filter
Sample framing with user selectable frame size (down to single samples)
Most configurations require only a single hardware thread
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Overview$$$High-Level Process View£££modules/io/modules/mic_array/doc/rst/src/overview.html#high-level-process-view
This section gives a brief overview of the steps to process a PDM audio stream
into a PCM audio stream. This section is concerned with the steady state
behavior and does not describe any necessary initialization steps. The high level
process view is depicted in the figure Mic Array High Level Process.
The mic array unit uses two different execution contexts. The first is the PDM
rx service (“PDM rx”), which is responsible for reading PDM samples from the
physical port, and has relatively little work to do, but also has a strict
real-time constraint on reading port data in a timely manner. The second is the
decimation thread, which is where all processing other than PDM capture is
performed.
This two-context model relaxes the need for tight coupling and synchronization
between PDM rx and the decimation thread, allowing significant flexibility in
how samples are processed in the decimation thread.
PDM rx is typically run within an interrupt on the same hardware core as the
decimation thread, but it can also be run as a separate thread in cases where
many channels result in a high processing load.
Likewise, the decimators may be split into multiple parallel hardware threads
in the case where the processing load exceeds the MIPS available in a single
thread.
The PDM data signal is captured by the xcore.ai device’s port hardware. The port
receiving the PDM signals buffers the received samples. Each time the port
buffer is filled, PDM rx reads the received samples.
Samples are collected word-by-word and assembled into blocks. Each time a block
has been filled, the block is transferred to the decimation thread where all
remaining mic array processing takes place.
The size of PDM data blocks varies depending upon the configured number of
microphone channels and the configured second stage decimator’s decimation
factor. Each PDM data block will contain exactly enough PDM samples to produce
one new mic array (multi-channel) output sample.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Overview$$$High-Level Process View$$$Step 2: First Stage Decimation£££modules/io/modules/mic_array/doc/rst/src/overview.html#step-2-first-stage-decimation
The conversion from the high-sample-rate PDM stream to lower-sample-rate PCM
stream involves two stages of decimating filters. After the decimation thread
receives a block of PDM samples, the samples are filtered by the first stage
decimator.
The first stage decimator has a fixed decimation factor of 32 and a fixed
tap count of 256. An application is free to supply its own filter
coefficients for the first stage decimator (using the fixed decimation factor
and tap count), however this library also provides a reference filter for the
first stage decimator that is recommended for most applications.
The first stage decimating filter is an FIR filter with 16-bit coefficients, and
where each input sample corresponds to a +1 or a -1 (typical for PDM
signals). The output of the first stage decimator is a block of 32-bit PCM
samples with a sample time 32 times longer than the PDM sample time.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Overview$$$High-Level Process View$$$Step 3: Second Stage Decimation£££modules/io/modules/mic_array/doc/rst/src/overview.html#step-3-second-stage-decimation
The second stage decimator is a decimating FIR filter with a configurable
decimation factor and tap count. Like the first stage decimator, this library
provides a reference filter suitable for the second stage decimator. The
supplied filter has a tap count of 65 and a decimation factor of 6.
The output of the first stage decimator is a block of N*K PCM values,
where N is the number of microphones and K is the second stage
decimation factor. This is just enough samples to produce one output sample from
the second stage decimator.
The resulting sample is vector-valued (one element per channel) and has a sample
time corresponding to 32*K PDM clock periods. Using the reference filters
and a 3.072 MHz PDM clock, the output sample rate is 16 kHz.
After second stage decimation, the resulting sample goes to post-processing
where two (optional) post-processing steps are available.
The first is a simple IIR filter, called DC Offset Elimination, which seeks to
ensure each output channel tends to approach zero mean. DC Offset Elimination
can be disabled if not desired. See Sample Filters for further details.
The second post-processing step is framing, where instead of signaling each
sample of audio to subsequent processing stages one at a time, samples can be
aggregated and transferred to subsequent processing stages as non-overlapping
blocks. The size of each frame is configurable (down to 1 sample per frame,
where framing is functionally disabled).
Finally, the sample or frame is transmitted over a channel from the mic array
module to the next stage of the processing pipeline.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Overview$$$High-Level Process View$$$Extending/Modifying Mic Array Behavior£££modules/io/modules/mic_array/doc/rst/src/overview.html#extending-modifying-mic-array-behavior
At the core of lib_mic_array are several C++ class templates which are
loosely coupled and intended to be easily overridden for modified behavior. The
mic array unit itself is an object made by the composition of several smaller
components which perform well-defined roles.
For example, modifying the mic array unit to use some mechanism other than a
channel to move the audio frames out of the mic array is a matter of defining a
small new class encapsulating just the modified transfer behavior, and then
instantiating the mic array class template with the new class as the appropriate
template parameter.
With that in mind, while most applications will have no need to modify the mic
array behavior, it is nevertheless designed to be easy to do so.
There are three models for how the mic array unit can be included in an
application. The details of how to allocate, initialize and start the mic array
will depend on the chosen model.
In order of increasing complexity, these are:
Vanilla Model - The simplest way to include the mic array. It is usually
sufficient but offers comparatively little flexibility with respect to
configuration and run-time control. Using this model (mostly) means modifying
an application’s build scripts.
Prefab Model - This model involves a little more effort from the application
developer, including writing a couple C++ wrapper functions, but gives the
application access to any of the defined prefab mic array components.
General Model - Any other case. This is necessary if an application wishes to
use a customized mic array component.
The vanilla and prefab models for integrating the mic array into your
application will be discussed in more detail below. The general model may
involve customizing or extending the classes in lib_mic_array and is beyond
the scope of this introduction.
Whichever model is chosen, the first step to integrate a mic array unit into an
application is to identify the required hardware resources.
The key hardware resources to be identified are the ports and clock blocks
that will be used by the mic array unit. The ports correspond to the physical
pins on which clocks and sample data will be signaled. Clock blocks are a type
of hardware resource which can be attached to ports to coordinate the
presentation and capture of signals on physical pins.
While clock blocks may be more abstract than ports, their implications for this
library are actually simpler. First, the mic array unit will need a way of
taking the audio master clock and dividing it to produce a PDM sample clock.
This can be accomplished with a clock block. This will be the clock block which
the API documentation refers to as “Clock A”.
Second, if (and only if) the PDM microphones are being used in a Dual Data Rate
(DDR) configuration a second clock block will be required. In a DDR
configuration 2 microphones share a physical pin for output sample data, where
one signals on the rising edge of the PDM clock and the other signals on the
falling edge. The second clock block required in a DDR configuration is referred
to as “Clock B” in the API documentation.
Each tile on an xcore.ai device has 5 clock blocks available. In code, a clock
block is identified by its resource ID, which are given as the preprocessor
macros XS1_CLKBLK_1 through XS1_CLKBLK_5.
Unlike ports, which are tied to specific physical pins, clock blocks are
fungible. Your application is free to use any clock block that has not already
been allocated for another purpose. The vanilla component model defaults to
using XS1_CLKBLK_1 and XS1_CLKBLK_2.
Three ports are needed for the mic array component. As mentioned above, ports
are physically tied to specific device pins, and so the correct ports must be
identified for correct behavior.
Note that while ports are physically tied to specific pins, this is not a
1-to-1 mapping. Each port has a port width (measured in bits) which is the
number of pins which comprise the port. Further, the pin mappings for different
ports overlap, with a single pin potentially belonging to multiple ports. When
identifying the needed ports, take care that both the pin map (see the
documentation for your xcore.ai package) and port width are correct.
The first port needed is a 1-bit port on which the audio master clock is
received. In the documentation, this is usually referred to as p_mclk.
The second port needed is a 1-bit port on which the PDM clock will be signaled
to the PDM mics. This port is referred to as p_pdm_clk.
The third port is that on which the PDM data is received. In an SDR
configuration, the width of this port must be greater than or equal to the
number of microphones. In a DDR configuration, twice this port width must be
greater than or equal to the number of microphones. This port is referred to as
p_pdm_mics.
XCore applications are typically compiled with an “XN” file (with a “.xn” file
extension). An XN file is an XML document which describes some information about
the device package as well as some other helpful board-related information. The
identification of your ports may have already been done for you in your XN file.
Following is a snippet from an XN file with mappings for the three ports
described above:
...
<TileNumber="1"Reference="tile[1]"><!-- MIC related ports --><PortLocation="XS1_PORT_1G"Name="PORT_PDM_CLK"/><PortLocation="XS1_PORT_1F"Name="PORT_PDM_DATA"/><!-- Audio ports --><PortLocation="XS1_PORT_1D"Name="PORT_MCLK_IN_OUT"/><PortLocation="XS1_PORT_1C"Name="PORT_I2S_BCLK"/><PortLocation="XS1_PORT_1B"Name="PORT_I2S_LRCLK"/><!-- Used for looping back clocks --><PortLocation="XS1_PORT_1N"Name="PORT_NOT_IN_PACKAGE_1"/></Tile>
...
The first 3 ports listed, PORT_PDM_CLK, PORT_PDM_DATA and
PORT_MCLK_IN_OUT are respectively p_pdm_clk, p_pdm_mics and
p_mclk. The value in the Location attribute (e.g. XS1_PORT_1G) is
the port name as you will find it in your package documentation.
In this case, either PORT_PDM_CLK or XS1_PORT_1G can be used in code to
identify this port.
Once the ports and clock blocks to be used have been identified, these
resources can be represented in code using a pdm_rx_resources_t struct. The
following is an example of declaring resources in a DDR configuration. See
pdm_rx_resources_t, PDM_RX_RESOURCES_SDR() and
PDM_RX_RESOURCES_DDR() for more details.
In addition to ports and clock blocks, there are also several other hardware
resource types used by lib_mic_array which are worth considering. Running
out of any of these will preclude the mic array from running correctly (if at
all)
Threads - At least one hardware thread is required to run the mic array
component.
Compute - The mic array unit will require a fixed number of MIPS (millions of
instructions per second) to perform the required processing. The exact
requirement will depend on the configuration used.
Memory - The mic array requires a modest amount of memory for code and data.
(see Mic Array Resource Usage).
Chanends - At least 4 chanends must be available for signaling between
threads/sub-components.
Mic array configuration with the vanilla model is achieved mostly through the
application’s build system configuration.
In the /etc/vanilla directory of the lib_mic_array repository are a
source and header file which are not compiled with (or on the include path) of
the library. Configuring the mic array using the vanilla model means adding
those files to your application’s build (not the library target), and
defining several compile options which tell it how to behave.
To simplify this further, a CMake macro called mic_array_vanilla_add() has
been included with the build system.
mic_array_vanilla_add() takes several arguments:
TARGET_NAME - The name of the CMake application target that the vanilla
mode source should be added to.
MCLK_FREQ - The frequency of the master audio clock, in Hz.
PDM_FREQ - The desired frequency of the PDM clock, in Hz.
MIC_COUNT - The number of microphone channels to be captured.
SAMPLES_PER_FRAME - The size of the audio frames produced by the mic array
unit (frames will be 2 dimensional arrays with shape
(MIC_COUNT,SAMPLES_PER_FRAME)).
Though not exposed by the mic_array_vanilla_add() macro, several additional
configuration options are available when using the vanilla model. These are all
configured by adding defines to the application target.
Once the configuration options have been chosen, initializing and starting the
mic array at run-time is easily achieved. Two function calls are necessary, both
are included through mic_array_vanilla.h (which was added to your include
path through your build configuration).
First, during application initialization, the function
ma_vanilla_init(), which takes no arguments, must be called. This will
configure the hardware resources and install the PDM rx service as an ISR, but
will not actually start any threads or PDM capture.
Once any remaining application initialization is complete, PDM capture and
processing is started by calling ma_vanilla_task().
ma_vanilla_task() is a blocking call which takes a single argument which is
the chanend that will be used to transmit audio frames to subsequent stages of
the processing pipeline. Usually the call to ma_vanilla_task() will be
placed directly in a par{...} block along with other threads to be started
on the tile.
Note
Both ma_vanilla_init() and ma_vanilla_task() must be called from the
core which will host the decimation thread.
The lib_mic_array library has a C++ namespace mic_array::prefab which
contains class templates for typical mic array setups using common
sub-components. The templates in the mic_array::prefab namespace hide most
of the complexity (and unneeded flexibility) from the application author, so
they can focus only on pieces they care about.
Note
As of version 5.0.1, only one prefab class template,
BasicMicArray, has been
defined.
To configure the mic array using a prefab, you will need to add a C++ source
file to your application. NB: This will end up looking a lot like the contents
of mic_array_vanilla.cpp when you are through.
The example in this section will use 2 microphones in a DDR configuration
with DC offset elimination enabled, and using 128-sample frames. The resource
IDs used may differ than those required for your application.
pdm_res will be used to identify the ports and clocks which will be
configured for PDM capture.
The C++ class template MicArray is central to
the mic array unit in this library. The class templates defined in the
mic_array::prefab namespace each derive from mic_array::MicArray.
Define and allocate the specific implementation of MicArray to be used.
...// Using the full name of the class could become cumbersome. Using an alias.usingTMicArray=mic_array::prefab::BasicMicArray<MIC_COUNT,FRAME_SIZE,DCOE_ENABLED>// Allocate mic arrayTMicArraymics=TMicArray();...
Now the mic array unit has been defined and allocated. The template parameters
supplied (e.g. MIC_COUNT and FRAME_SIZE) are used to calculate the size of
any data buffers required by the mic array, and so the mics object is
self-contained, with all required buffers being statically allocated.
Additionally, class templates will ultimately allow unused features to be
optimized out at build time. For example, if DCOE is disabled, it will be
optimized out at build time so that at run time it won’t even need to check
whether DCOE is enabled.
Now a couple functions need to be implemented in your C++ file. In most cases
these functions will need to be callable from C or XC, and so they should not be
static, and they should be decorated with extern"C" (or the MA_C_API
preprocessor macro provided by the library).
First, a function which initializes the MicArray object and configures the
port and clock block resources. The documentation for
BasicMicArray indicates any
parts of the MicArray object that need to be initialized.
#define MCLK_FREQ 24576000#define PDM_FREQ 3072000...MA_C_APIvoidapp_init(){// Configure clocks and portsconstunsignedmclk_div=mic_array_mclk_divider(MCLK_FREQ,PDM_FREQ);mic_array_resources_configure(&pdm_res,mclk_div);// Initialize the PDM rx servicemics.PdmRx.Init(pdm_res.p_pdm_mics);}...
app_init() can be called from an XC main() during initialization.
Assuming the PDM rx service is to be run as an ISR, a second function is used to
actually start the mic array unit. This starts the PDM clock, install the ISR
and enter the decimator thread’s main loop.
MA_C_APIvoidapp_mic_array_task(chanend_tc_audio_frames){mics.SetOutputChannel(c_audio_frames);// Start the PDM clockmic_array_pdm_clock_start(&pdm_res);mics.InstallPdmRxISR();mics.UnmaskPdmRxISR();mics.ThreadEntry();}
Now a call to app_mic_array_task() with the channel to send frames on can be
placed inside a par{...} block to spawn the thread.
The mic array unit provided by this library uses a two-stage decimation process
to convert a high sample rate stream of (1-bit) PDM samples into a lower sample
rate stream of (32-bit) PCM samples.
For the first stage decimating FIR filter, the actual filter coefficients used
are configurable, so an application is free to use a custom first stage filter,
as long as the tap count is 256. This library also provides coefficients for
the first stage filter, whose filter characteristics are adequate for most
applications.
The input to the first stage decimator (here called “Stream A”) is a stream of
1-bit PDM samples with a sample rate of PDM_FREQ. Rather than each PDM
sample representing a value of 0 or 1, each PDM sample represents a
value of either +1 or -1. Specifically, on-chip and in-memory, a bit
value of 0 represents +1 and a bit value of 1 represents -1.
The output from the first stage decimator, Stream B, is a stream of 32-bit PCM
samples with a sample rate of PDM_FREQ/S1_DEC_FACTOR=PDM_FREQ/32. For
example, if PDM_FREQ is 3.072 MHz, then Stream B’s sample rate is 96.0 kHz.
The first stage filter is structured to make optimal use of the XCore XS3 vector
processing unit (VPU), which can compute the dot product of a pair of
256-element 1-bit vectors in a single cycle. The first stage uses 256 16-bit
coefficients for its filter taps.
Each time 32 PDM samples (1 word) become available for an audio channel, those
samples are shifted into the 8-word (256-bit) filter state, and a call to
fir_1x16_bit results in 1 Stream B sample element for that channel.
The actual implementation for the first stage filter can be found in
src/fir_1x16_bit.S. Additional usage details can be found in
api/etc/fir_1x16_bit.h.
Note that the 256 16-bit filter coefficients are not stored in memory as a
standard coefficient array (i.e. int16_tfilter[256]={b[0],b[1],...};).
Rather, in order to take advantage of the VPU, the coefficients must be
rearranged bit-by-bit into a block form suitable for VPU processing. See the
section below on filter conversion if supplying a custom filter for stage 1.
This library provides filter coefficients that may be used with the first stage
decimator. These coefficients are available in your application through the
header mic_array/etc/filters_default.h as stage1_coef.
Taking a set of floating-point coefficients, quantizing them into 16-bit
coefficients and ‘boggling’ them into the correct memory layout can be a tricky
business. To simplify this process, this library provides a Python (3) script
which does this process for you.
The script can be found in this repository at python/stage1.py.
An application is free to supply its own second stage filter. This library also
provides a second stage filter whose characteristics are adequate for many or
most applications.
The input to the second stage decimator (here called “Stream B”) is the stream
of 32-bit PCM samples emitted from the first stage decimator with a sample rate
of PDM_FREQ/32.
The output from the second stage decimator, Stream C, is a stream of 32-bit PCM
samples with a sample rate of PDM_FREQ/(32*S2_DEC_FACTOR). For example, if
PDM_FREQ is 3.072 MHz, and S2_DEC_FACTOR is 6, then Stream C’s
sample rate (the sample rate received by the main application code) is
3.072 MHz / (32*6) = 16 kHz
The second stage filter uses the 32-bit FIR filter implementation from
lib_xcore_math. See
xs3_filter_fir_s32() in that library for more implementation details.
This library provides a filter suitable for the second stage decimator. It is
available in your application through the header
mic_array/etc/filters_default.h.
For the provided filter S2_TAP_COUNT=65, and S2_DEC_FACTOR=6.
Without writing a custom decimator implementation, the tap count and decimation
factor for the first stage decimator are fixed to 256 and 32
respectively. These can be modified for the second stage, and the filter
coefficients for both stages can be modified.
When using the C++ API to construct your application’s mic array component, the
decimator’s metaparameters (tap count, decimation factor) are given as C++
template parameters for the decimator class template. Pointers to the
coefficients are provided to the decimator when it is initialized.
To keep things simple, when using the vanilla API or when constructing the mic
array component using BasicMicArray, it is assumed that the filter parameters
will be those from stage1_fir_coef.c, stage2_fir_coef.c and
filters_default.h. In this case it is recommended to simple change those
files directly with the updated coefficients. Otherwise you may need to use the
C++ API directly.
Note that both the first and second stage filters are implemented using
fixed-point arithmetic which requires the coefficients to be presented in a
particular format. The Python scripts stage1.py and stage2.py, provided with
this library, can be used to help with this formatting. See the associated README for usage details.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Decimator Stages$$$Custom Filters$$$Configuring for 32 kHz or 48 kHz output£££modules/io/modules/mic_array/doc/rst/src/decimator_stages.html#configuring-for-32-khz-or-48-khz-output
Filter design scripts are provided to support higher output sampling rates than the default 16 kHz.
Both stage 1 and stage 2 need to be updated because the first stage needs a higher
cut off frequency before samples are passed to the downsample by three (32 kHz) or two (48 kHz) second stage
decimator.
From the command line, follow these instructions:
pythonfilter_design/design_filter.py# generate the filter .pkl filespythonstage1.pygood_32k_filter_int.pkl# convert the .pkl file to a C style array for stage 1pythonstage2.pygood_32k_filter_int.pkl# convert the .pkl file to a C style array for stage 2
Note
Use good_48k_filter_int.pkl instead of good_32k_filter_int.pkl to support 48 kHz.
Next copy the output from last two scripts into a source file. This could be your mic_array.cpp
file which launches the mic array tasks. It may look something like this:
The new decimation object must now be declared that references your new filter coefficients.
Again, this example is for 32 kHz output since the decimation factor is 3.:
Next you need to change how you initialise and run the mic array task to reference your new
mic array custom object. Normally the following code would be used in ma_init():
The increased sample rate will place a higher MIPS burden on the processor. The typical
MIPS usage (see section Mic Array Resource Usage) is in the order of 11 MIPS per channel
using a 16 kHz output decimator.
Increasing the output sample rate to 32 kHz using the same length filters will increase
processor usage per channel to approximately 13 MIPS rising to 15.6 MIPS for 48 kHz.
Increasing the filer lengths to 148 and 96 for stages 1 and 2 respectively at 48 kHz
will increase processor usage per channel to around 20 MIPS.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Decimator Stages$$$Custom Filters$$$Configuring for 32 kHz or 48 kHz output$$$Filter Characteristics for good_32k_filter_int.pkl£££modules/io/modules/mic_array/doc/rst/src/decimator_stages.html#filter-characteristics-for-good-32k-filter-int-pkl
The plot below indicates the frequency response of the first and second stages of the
provided 32 kHz filters as well as the cascaded overall response. Note that the
overall combined response provides a nice flat passband as shown in the good_32k_filter_int.pkl frequency response.
good_32k_filter_int.pkl frequency response
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Decimator Stages$$$Custom Filters$$$Configuring for 32 kHz or 48 kHz output$$$Filter Characteristics for good_48k_filter_int.pkl£££modules/io/modules/mic_array/doc/rst/src/decimator_stages.html#filter-characteristics-for-good-48k-filter-int-pkl
The plot below indicates the frequency response of the first and second stages of the
provided 48 kHz filters as well as the cascaded overall response. Note that the
overall combined response provides a nice flat passband as shown good_48k_filter_int.pkl frequency response.
Following the two-stage decimation procedure is an optional post-processing
stage called the sample filter. This stage operates on each sample emitted by
the second stage decimator, one at a time, before the samples are handed off for
framing or transfer to the rest of the application’s audio pipeline.
Note
This is represented by the SampleFilter sub-component of the
MicArray class template.
An application may implement its own sample filter in the form of a C++ class
which implements the Filter() function as required by MicArray. See the
implementation of DcoeSampleFilter
for a simple example.
The current version of this library provides a simple IIR filter called DC
Offset Elimination (DCOE) that can be used as the sample filter. This is a
high-pass filter meant to ensure that each audio channel will tend towards a
mean sample value of zero.
Whether the DCOE filter is enabled by default and how to enable or disable it
depends on which approach your project uses to include the mic array component
in the application.
If your project uses the vanilla model (see Vanilla API) to include the
mic array unit in your application, then DCOE is enabled by default. To
disable DCOE your build script must add a compiler option to your application
target that sets the MIC_ARRAY_CONFIG_USE_DC_ELIMINATION preprocessor macro
to the value 0.
For example, in a typical application’s CMakeLists.txt, that may look like
the following.
# Gather sources and create application target# ...# Add vanilla source to application buildmic_array_vanilla_add(my_app${MCLK_FREQ}${PDM_FREQ}${MIC_COUNT}${FRAME_SIZE})# ...# Disable DCOEtarget_compile_definitions(my_appPRIVATEMIC_ARRAY_CONFIG_USE_DC_ELIMINATION=0)
If your project instantiates the
BasicMicArray class template to
include the mic array unit, DC offset elimination is enabled or disabled with
the USE_DCOE boolean template parameter (there is no default).
The sample filter chosen is based on the USE_DCOE template parameter when
the class template gets instantiated. If true,
DcoeSampleFilter will be selected as
the MicArraySampleFilter sub-component. Otherwise
NopSampleFilter will be used.
Note
NopSampleFilter is a no-op filter – it does not modify the samples given
to it and ultimately will be completely optimized out at compile time.
For example, in your application source:
#include"mic_array/mic_array.h"...// Controls whether DCOE is enabledstaticconstexprboolenable_dcoe=true;automics=mic_array::prefab::BasicMicArray<MICS,FRAME_SIZE,enable_dcoe>();...
If your project does not use either the vanilla or prefab models to include the
mic array unit in your application, then precisely how the DCOE filter is
included may depend on the specifics of your application. In general, however,
the DCOE filter will be enabled by using
DcoeSampleFilter as the
TSampleFilter template parameter for the
MicArray class template.
For example, sub-classing mic_array::MicArray as follows will enable DCOE
for any MicArray implementation deriving from that sub-class.
As mentioned above, the DCOE filter is a simple IIR filter given by the
following equation, where x[t] and x[t-1] are the current and previous
input sample values respectively, and y[t] and y[t-1] are the current
and previous output sample values respectively.
The core of lib_mic_array are a set of C++ class templates representing the
mic array unit and its sub-components.
The template parameters of these class templates are (mainly) used for two
different purposes. Non-type template parameters are used to specify certain
quantitative configuration values, such as the number of microphone channels or
the second stage decimator tap count. Type template parameters, on the other
hand, are used for configuring the behavior of sub-components.
At the heart of the mic array API is the
MicArray class template.
Note
All classes and class templates mentioned are in the mic_array C++
namespace unless otherwise specified. Additionally, this documentation may
refer to class templates (e.g. MicArray) with unbound template
parameters as “classes” when doing so is unlikely to lead to confusion.
The MicArray class template looks like the
following:
Here the non-type template parameter MIC_COUNT indicates the number of
microphone channels to be captured and processed by the mic array unit. Most of
the class templates have this as a parameter.
Transferring audio data to
subsequent pipeline stages.
Each of the MicArray sub-components has a type that is specified as a
template parameter when the class template is instantiated. MicArray
requires the class of each of its sub-components to implement a certain minimal
interface. The MicArray object interacts with its sub-components using this
interface.
Note
Abstract classes are not used to enforce this interface contract. Instead,
the contract is enforced (at compile time) solely in how the MicArray
object makes use of the sub-component.
The following diagram Mic Array High Level Process conceptually captures the flow of information through the
MicArray sub-components.
Mic Array High Level Process
Note
MicArray does not enforce the use of an XCore port for collecting PDM
samples or an XCore channel for transferring processed data. This is just the
typical usage.
Aside from aggregating its sub-components into a single logical entity, the
MicArray class template also holds the high-level logic for capturing,
processing and coordinating movement of the audio stream data.
The following code snippet is the implementation for the main mic array thread
(or “decimation thread”; not to be confused with (optional) PDM capture thread).
Requests a block of PDM sample data from the PDM rx service. This is a
blocking call which only returns once a complete block becomes
available.
Passes the block of PDM sample data to the decimator to produce a single
output sample.
Applies a post-processing filter to the sample data.
Passes the processed sample to the output handler to be transferred to the
next stage of the processing pipeline. This may also be a blocking call, only
returning once the data has been
transferred.
Note that the MicArray object doesn’t care how these steps are actually
implemented. For example, one output handler implementation may send samples
one at a time over a channel. Another output handler implementation may collect
samples into frames, and use a FreeRTOS queue to transfer the data to another
thread.
Instead of providing flexibility through abstract classes or polymorphism, CRTP
achieves flexibility through the use of class templates with type template
parameters. As with derived classes and virtual methods, the CRTP template
parameter must follow a contract with the class template where it implements
one or more methods with specific names and signatures that the class template
directly calls.
There are a couple notable advantages of using CRTP over polymorphic behavior.
With CRTP flexibility does not generally come with the same run-time costs (in
terms of both compute and memory) as polymorphic solutions. This is because the
CRTP class template always knows the concrete type of any objects it uses at
compile time. This avoids the need for run time type information or virtual
function tables. This allows compile time optimizations can be made which may
not be otherwise available. This in-turn allows many function calls to be
inlined, or in some cases, entirely eliminated.
Additionally, while not strictly an example of CRTP, integer template parameters
are also heavily used in class templates. The two main advantages of this are
that it allows objects to encapsulate their own (statically allocated) memory,
and that it allows the compiler to make compile time loop optimizations that it
may not otherwise be able to make.
The downside to CRTP is that it tends to lead to highly verbose class type
names, where templated classes end up with type parameter assignments are
themselves templated classes with their own template parameters.
Each of MicArray’s sub-components may have implementation-specific
configuration or initialization requirements. Each sub-component is a public
member of MicArray (see table above). An application can access a
sub-component directly to perform any type-specific initialization or other
manipulation.
For example, the
ChannelFrameTransmitter output
handler class needs to know the chanend to be used for sending samples. This
can be initialized on a MicArray object mics with
mics.OutputHandler.SetChannel(c_sample_out).
PdmRx, or the PDM rx service is the
MicArray sub-component responsible for capturing PDM sample data, assembling
it into blocks, and passing it along so that it can be decimated.
The MicArray class requires only that PdmRx implement GetPdmBlock(),
a blocking call that returns a pointer to a block of PDM data which is ready for
further processing.
Generally speaking, PdmRx will derive from the
PdmRxService
class template. PdmRxService encapsulates the logic of using an xCore
port for capturing PDM samples one word (32 bits) at a time, and managing
two buffers where blocks of samples are collected. It also simplifies the logic
of running PDM rx as either an interrupt or as a stand-alone thread.
PdmRxService has 2 template parameters. The first is the BLOCK_SIZE,
which specifies the size of a PDM sample block (in words). The second,
SubType, is the type of the sub-class being derived from PdmRxService.
This is the CRTP (Curiously Recurring Template Pattern), which allows a base
class to use polymorphic-like behaviors while ensuring that all types are known
at compile-time, avoiding the drawbacks of using virtual functions.
There is currently one class template which derives from PdmRxService,
called StandardPdmRxService.
StandardPdmRxService uses a streaming channel to transfer PDM blocks to the
decimator. It also provides methods for installing an optimized ISR for PDM
capture.
The Decimator sub-component
encapsulates the logic of converting blocks of PDM samples into PCM samples. The
TwoStageDecimator class is a
decimator implementation that uses a pair of decimating FIR filters to
accomplish this.
The first stage has a fixed tap count of 256 and a fixed decimation factor
of 32. The second stage has a configurable tap count and decimation factor.
The SampleFilter sub-component
is used for post-processing samples emitted by the decimator. Two
implementations for the sample filter sub-component are provided by this
library.
The NopSampleFilter class can be used
to effectively disable per-sample filtering on the output of the decimator. It
does nothing to the samples presented to it, and so calls to it can be optimized
out during compilation.
The DcoeSampleFilter class is used
for applying the DC offset elimination filter to the decimator’s output. The DC
offset elimination filter is meant to ensure the sample mean for each channel
tends toward zero.
The OutputHandler
sub-component is responsible for transferring processed sample data to
subsequent processing stages.
There are two main considerations for output handlers. The first is whether
audio data should be transferred sample-by-sample or as frames containing
many samples. The second is the method of actually transferring the audio data.
The class
ChannelSampleTransmitter
sends samples one at a time to subsequent processing stages using an xCore
channel.
The FrameOutputHandler class
collects samples into frames, and uses a frame transmitter to send the frames
once they’re ready.
One of the drawbacks to broad use of class templates is that concrete class
names can unfortunately become excessively verbose and confusing. For example,
the following is the fully qualified name of a (particular) concrete
MicArray implementation:
This library also provides a C++ namespace mic_array::prefab which is
intended to simplify construction of MicArray objects where common
configurations are needed.
The BasicMicArray class template
uses the most typical component implementations, where PDM rx can be run as an
interrupt or as a stand-alone thread, and where audio frames are transmitted to
subsequent processing stages using a channel.
To demonstrate how BasicMicArray simplifies this process, observe that the
following MicArray type is behaviorally identical to the above:
The mic array unit requires several kinds of hardware resources, including
ports, clock blocks, chanends, hardware threads, compute time (MIPS) and memory.
Compared to previous versions of this library, the biggest advantage to the
current version with respect to hardware resources is a greatly reduced compute
requirement. This was made possible by the introduction of the VPU in the XMOS
XS3 architecture. The VPU can do certain operations in a single instruction
which would take many, many instructions on previous architectures.
This page attempts to capture the requirements for each hardware type with
relevant configurations.
Warning
The usage information below applies when the Vanilla API or prefab APIs are
used. Resource usage in an application which uses custom mic array
sub-components will depend crucially on the specifics of the customization.
In all configurations, the mic array unit requires 3 of the xcore.ai device’s
hardware ports. Two of these ports (for the master audio clock and PDM clock)
must be 1-bit ports. The third (PDM capture port) can be 1-, 4- or 8-bit,
depending on the microphone count and SDR/DDR configuration.
In applications which use an SDR microphone configuration, the mic array unit
requires 1 of the xcore.ai device’s 5 clock blocks. This clock block is used
both to generate the PDM clock from the master audio clock and as the PDM
capture clock.
In applications which use a DDR microphone configuration, the mic array unit
requires 2 of the xcore.ai device’s 5 clock blocks. One clock is used to
generate the PDM clock from the master audio clock, and the other is used as the
PDM capture clock (which must operate at different rates in a DDR
configuration).
Chanends are a hardware resource which allow threads (possibly running on
different tiles) to communicate over channels. The mic array unit requires 4
chanends. Two are used for communication between the PDM rx service and the
decimation thread. Two more are needed for transfering completed frames from the
mic array unit to other application components.
The prefab API can run the PDM rx service either as a stand-alone thread or as
an interrupt in another thread. The Vanilla API only supports running it as an
interrupt. The Vanilla API requires only on hardware thread. The prefab API
requires 1 thread if PDM rx is used in interrupt mode, and 2 if PDM rx is a
stand-alone thread..
Running PDM rx as a stand-alone thread modestly reduces the mic array unit’s
MIPS consumption by eliminating the context switch overhead of an interrupt. The
cost of that is one hardware thread.
Note
When configured as an interrupt, PDM rx ISR is typically configured on the
decimation thread, but this is not a strict requirement. The PDM rx interrupt
can be configured for any thread on the same tile as the decimation thread.
They must be on the same tile because shared memory is used between the two
contexts.
The compute requirement of the mic array unit depends strongly on the actual
configuration being used. The compute requirement is expressed in millions of
instructions per second (MIPS) and is approximately linearly related to many
of the configuration parameters.
Each tile of an xcore.ai device has 8 hardware threads and a 5 stage pipline.
The exact calculation of how many MIPS are available to a thread is complicated,
and is, in general, affected by both the number of threads being used, as well
as the work being done by each thread.
As a rule of thumb, however, the core scheduler will offer each thread a minimum
of CORE_CLOCK_MHZ/8 millions of instruction issue slots per second (~MIPS),
and no more than CORE_CLOCK_MHZ/5 millions of issue slots per second, where
CORE_CLOCK_MHZ is the core CPU clock rate. With a core clock rate of 600
MHz, that means that each core should expect at least 75 MIPS.
The MIPS values in the table below are estimates obtained using the demo
applications in demo/measure_mips.
PDM Freq
S2DF
S2TC
PdmRx
1 mic
MIPS
2 mic
MIPS
4 mic
MIPS
8 mic
MIPS
3.072 MHz
6
65
ISR
10.65
22.00
43.70
N/A
3.072 MHz
6
65
Thread
9.33
19.37
38.48
75.90
6.144 MHz
6
65
ISR
21.26
43.89
TBD
TBD
6.144 MHz
6
65
Thread
18.66
38.73
TBD
TBD
3.072 MHz
3
65
ISR
12.90
26.44
TBD
TBD
3.072 MHz
3
65
Thread
11.62
23.85
TBD
TBD
3.072 MHz
6
130
ISR
11.17
23.04
TBD
TBD
3.072 MHz
6
130
Thread
9.86
20.42
TBD
TBD
PDM Freq
Frequency of the PDM clock.
S2DF
Stage 2 decimation factor. Output sample rate is (PDMFreq/(32*S2DF)).
S2TC
Stage 2 tap count.
PdmRx
Whether PDM capture is done in a stand-alone thread or in an ISR.
Measurements indicate that enabling or disabling the DC offset removal filter
has little effect on the MIPS usage. The selected frame size has only a slight
negative correlation with MIPS usage.
The memory cost of the mic array unit has three parts: code, stack and data.
Code is the memory needed to store compiled instructions in RAM. Stack is the
memory required to store intermediate results during function calls, and data is
the memory used to store persistant objects, variables and constants.
The stack memory requirement is minimal. The code memory requirement depends on
the particular configuration, but ranges from about 1600 bytes in a 1 mic
configuration to about 2000 bytes in an 8 mic configuration.
Not included in the table is the space allocated for the first and second stage
filter coefficients. The first stage filter coefficients take a constant 523
bytes. The second stage filter coefficients use 4*S2TC bytes, where S2TC
is the stage 2 decimator tap count. The value shown in the ‘data’ column of the
table is the sizeof() the
BasicMicArray that is
instantiated. The table below indicates the data size for various
configurations.
Mics
S2DF
S2TC
SPF
DCOE
Data Memory
1
6
65
16
On
504 B
2
6
65
16
On
968 B
4
6
65
16
On
1888 B
8
6
65
16
On
3728 B
1
6
65
16
On
768 B
2
6
130
16
On
1488 B
1
6
130
16
On
576 B
2
12
65
16
On
1112 B
1
12
65
160
On
1080 B
2
6
65
160
On
2120 B
1
6
65
16
Off
496 B
2
6
65
16
Off
948 B
S2DF
Stage 2 decimator’s decimation factor.
S2TC
Stage 2 decimator’s tap count.
SPF
Samples per frame in frames delivered by the mic array unit.
The Vanilla API is a small optional API which greatly simplifies the process of
including a mic array unit in an xcore.ai application. Most applications that
make use of a PDM mic array will not have complicated needs from the mic array
software component beyond delivery of frames of audio data from a configurable
set of microphones at a configurable rate. This API targets that majority of
applications.
The prefab API requires the application developer to have at least some
minimal understanding of the objects and classes associated with the mic array
unit, and requires the developer to write some application-specific code to
configure and start the mic array. The Vanilla API (which builds on top of the
prefab model) by contrast, requires as little as two standard function calls,
and instead moves the majority of the application logic into the application’s
build project.
Note
Why “Vanilla”? “Vanilla” was originally meant as a generic placeholder
name, but no better name was ever suggested.
The Vanilla API comprises two code files, etc/vanilla/mic_array_vanilla.cpp
and etc/vanilla/mic_array_vanilla.h which are not compiled as part of this
library. Instead, if used, these are added to the application target’s build. To
control configuration, the source file relies on a set of pre-processor macros
(added via compile flags) which determine how the mic array unit will be
instantiated.
The API is included in an application by using a CMake macro
(mic_array_vanilla_add()) provided in this library. The macro updates the
application’s sources, includes and compile definitions to include the API.
In the application code, two function calls are needed. First,
ma_vanilla_init() is called to initialize the various mic array
sub-components, preparing for capture of PDM data. Then, to start capture the
decimation thread is started with ma_vanilla_task() as entrypoint.
ma_vanilla_task() takes an XCore chanend as a parameter, which
tells it where completed audio frames should be routed.
Note
The Vanilla API runs the PDM rx service as an interrupt in the decimation
thread. To run it as a separate thread (for reduced total MIPS consumption)
one of the lower-level APIs must be used.
As with the prefab API, audio frames are extracted from the mic array unit over
a (non-streaming) channel using the ma_frame_rx() or
ma_frame_rx_transpose() functions.
Note
The Vanilla API uses the default filters provided with this library,
and does not currently provide a way to override this. To use custom filters,
you must either use a lower-level API or modify the vanilla API.
Configuration with the Vanilla API is achieved through compile definitions. The
required definitions are provided through the mic_array_vanilla_add() macro.
There are several additional optional definitions.
The name of the application’s CMake target. It is the target the Vanilla API
is added to.
MCLK_FREQ
The known frequency, in Hz, of the application’s master audio clock. A typical
frequency is 24576000 Hz. Note that this parameter is not configuring the
master audio clock. (Equivalent compile definition:
MIC_ARRAY_CONFIG_MCLK_FREQ)
PDM_FREQ
The desired frequency, in Hz, of the PDM clock. This should be an integer
factor of MCLK_FREQ between 1 and 510. (Equivalent compile
definition: MIC_ARRAY_CONFIG_PDM_FREQ)
MIC_COUNT
The number of PDM microphone channels to be captured. This API supports values
of 1 (SDR), 2 (DDR), 4 (SDR) and 8 (SDR/DDR). This value must
match the configuration (SDR/DDR) and port width of the PDM capture port. That
is, in an SDR port configuration, MIC_COUNT must equal the capture port
width, and in DDR port configuration, MIC_COUNT must be twice the port
width. (Equivalent compile definition: MIC_ARRAY_CONFIG_MIC_COUNT)
Note
This API does not support capturing only a subset of the capture port’s
channels, e.g. capturing only 3 channels on a 4-bit port. To accomplish this
the prefab API should be used.
Note
Though listed under Optional Configuration below, if the microphones are in
a DDR configuration and MIC_COUNT is not 2, the application must
also define MIC_ARRAY_CONFIG_USE_DDR.
SAMPLES_PER_FRAME is the number of samples (for each microphone channel)
that will be delivered in each (non-overlapping) frame retrieved by
ma_frame_rx(). A minimum value of 1 is supported, to deliver
samples one at a time. The larger this value, the looser the real-time
constraint on the thread receiving the mic array unit’s output (while also
increasing the amount of audio data to be processed).
These are configuration parameters that receive default values but can be
optionally overridden by an application. These can be defined in your
application’s CMakeLists.txt using CMake’s built-in
target_compile_definitions() command.
MIC_ARRAY_CONFIG_USE_DDR
Indicates whether the microphones are arranged in an SDR (0) or DDR
(1) configuration. An SDR configuration is one in which each port pin is
connected to a single PDM microphone. A DDR configuration is one which each
port pin is connected to two PDM microphones. Defaults to 0 (SDR), unless
MIC_ARRAY_CONFIG_MIC_COUNT is 2 in which case it defaults to 1
(DDR).
MIC_ARRAY_CONFIG_USE_DC_ELIMINATION
Indicates whether the DC offset elimination filter
should be applied to the output of the decimator. Set to 0 to disable or
1 to enable. Defaults to 1 (filter on).
The next three parameters are the identifiers for hardware port resources used
by the mic array unit. They can be specified as either the identifier listed in
your device’s datasheet (e.g. XS1_PORT_1D) or as an alias for the port
listed in your application’s XN file (e.g. PORT_MCLK_IN_OUT). For example:
Identifier of the 1-bit port on which the device is receiving the master audio
clock. Defaults to PORT_MLCK_IN_OUT.
MIC_ARRAY_CONFIG_PORT_PDM_CLK
Identifier of the 1-bit port on which the device will signal the PDM clock to
the microphones. Defaults to PORT_PDM_CLK.
MIC_ARRAY_CONFIG_PORT_PDM_DATA
Identifier of the port on which the device will capture PDM sample data. The
port width of this port must match the MIC_COUNT parameter given to
mic_array_vanilla_add() and the value of MIC_ARRAY_CONFIG_USE_DDR.
Defaults to PORT_PDM_DATA.
The final two parameters indicate which clock block resource(s) should be used
to generate the PDM clock and the capture clock. An xcore.ai device provides 5
hardware clock blocks for application use, identified as XS1_CLKBLK_1
through XS1_CLKBLK_5. The device’s clock blocks are interchangeable, but if
another component of your application uses one of these defaults, you may need
to change these parameters.
MIC_ARRAY_CONFIG_CLOCK_BLOCK_A
Clock block used as ‘clock A’ (see Getting Started). This clock block
is used in both SDR and DDR configurations.
MIC_ARRAY_CONFIG_CLOCK_BLOCK_B
Clock block used as ‘clock B’ (see Getting Started). This clock block
is only needed in DDR configurations and is ignored (not configured) in SDR
configurations.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$Vanilla API$$$Configuration$$$Vanilla API with other Build Systems£££modules/io/modules/mic_array/doc/rst/src/vanilla_api.html#vanilla-api-with-other-build-systems
Using the Vanilla API with build systems other than CMake is simple.
Add the file etc/vanilla/mic_array_vanilla.cpp to the application’s
source files.
Add etc/vanilla/ (relative to repository root) to the application include
paths.
Add the compile definitions for the parameters listed in the previous sections
(each parameter beginning with MIC_ARRAY_CONFIG_) to the compile options
for mic_array_vanilla.cpp.
The number of microphones to be captured by the MicArray’s PdmRx component. For example, if using a 4-bit port to capture 6 microphone channels in a DDR configuration (because there are no 3 or 6 pin ports) MIC_COUNT should be 8, because that’s how many must be captured, even if two of them are stripped out before passing audio frames to subsequent application stages.
TDecimator – Type for the decimator. See Decimator.
TPdmRx – Type for the PDM rx service used. See PdmRx.
TSampleFilter – Type for the output filter used. See SampleFilter.
TOutputHandler – Type for the output handler used. See OutputHandler.
This constructor uses the default constructor for its Decimator and SampleFilter components.
The remaining components are initialized with the supplied objects.
Parameters:
pdm_rx – The PDM rx object.
output_handler – The OutputHandler object.
voidThreadEntry()
Entry point for the decimation thread.
This function does not return. It loops indefinitely, collecting blocks of PDM data from PdmRx (which must have already been started), uses Decimator to filter and decimate the sample stream to the output sample rate, applies any post-processing with SampleFilter, and then delivers the stream of output samples through OutputHandler.
The template parameter TPdmRx is the concrete class implementing the microphone array’s PDM rx service, which is responsible for collecting PDM samples from a port and delivering them to the decimation thread.
TPdmRx is only required to implement one function, GetPdmBlock():
uint32_t*GetPdmBlock();
GetPdmBlock() returns a pointer to a block of PDM data, formatted as expected by the decimator. GetPdmBlock() is called from the decimator thread and is expected to block until a new full block of PDM data is available to be decimated.
For example, StandardPdmRxService::GetPdmBlock() waits to receive a pointer to a block of PDM data from a streaming channel. The pointer is sent from the PdmRx interrupt (or thread) when the block has been completed. This is used for capturing PDM data from a port.
The template parameter TDecimator is the concrete class implementing the microphone array’s decimation procedure. TDecimator is only required to implement one function, ProcessBlock():
ProcessBlock() takes a block of PDM samples via its pdm_block parameter, applies the appropriate decimation logic, and outputs a single (multi-channel) sample sample via its sample_out parameter. The size and formatting of the PDM block expected by the decimator depends on its particular implementation.
The template parameter TSampleFilter is the concrete class implementing the microphone array’s sample filter component. This component can be used to apply additional non-decimating, non-interpolating filtering of samples. TSampleFilter() is only required to implement one function, Filter():
voidFilter(int32_tsample[MIC_COUNT]);
Filter() takes a single (multi-channel) sample from the decimator component’s output and may update the sample in-place.
For example a sample filter based on the DcoeSampleFilter class template applies a simple first-order IIR filter to the output of the decimator, in order to eliminate the DC component of the audio signals.
If no additional filtering is required, the NopSampleFilter class template can be used for TSampleFilter, which leaves the sample unmodified. In this case, it is expected that the call to NopSampleFilter::Filter() will ultimately get completely eliminated at build time. That way no addition run-time compute or memory costs need be introduced for the additional flexibility.
Even though TDecimator and TSampleFilter both (possibly) apply filtering, they are separate components of the MicArray because they are conceptually independent.
A concrete class based on either the DcoeSampleFilter class template or the NopSampleFilter class template is used in the prefab::BasicMicArray prefab, depending on the USE_DCOE parameter of that class template.
The template parameter TOutputHandler is the concrete class implementing the microphone array’s output handler component. After the PDM input stream has been decimated to the appropriate output sample rate, and after any post-processing of that output stream by the sample filter, the output samples must be delivered to another thread for any additional processing. It is the responsibility of this component to package and deliver audio samples to subsequent processing stages.
TOutputHandler is only required to implement one function, OutputSample():
voidOutputSample(int32_tsample[MIC_COUNT]);
OutputSample() is called exactly once for each mic array output sample. OutputSample() may block if necessary until the subsequent processing stage ready to receive new data. However, the decimator thread (in which OutputSample() is called) as a whole has a real-time constraint - it must be ready to pull the next block of PDM data while it is available.
Class template for a typical bare-metal mic array unit.
This prefab is likely the right starting point for most applications.
With this prefab, the decimator will consume one device core, and the PDM rx service can be run either as an interrupt, or as an additional thread. Normally running as an interrupt is recommended.
For the first and second stage decimation filters, this prefab uses the coefficients provided with this library. The first stage uses a decimation factor of 32, and the second stage is configured to use a decimation factor of 6.
To get 16 kHz audio output from the BasicMicArray prefab, then, the PDM clock must be configured to 3.072MHz
(3.072MHz/(32*6)=16kHz).
Sub-Components
Being derived from mic_array::MicArray, an instance of BasicMicArray has 4 sub-components responsible for different portions of the work being done. These sub-components are PdmRx, Decimator, SampleFilter and OutputHandler. See the documentation for MicArray for more details about these.
Template Parameters Details
The template parameter MIC_COUNT is the number of microphone channels to be processed and output.
The template parameter FRAME_SIZE is the number of samples in each output frame produced by the mic array. Frame data is communicated using the API found in mic_array/frame_transfer.h.
Typically ma_frame_rx() will be the right function to use in a
receiving thread to retrieve audio frames. ma_frame_rx() receives
audio frames with shape (MIC_COUNT,FRAME_SIZE), meaning that all
samples corresponding to a given channel will end up in a contiguous
block of memory. Instead of ma_frame_rx(),
ma_frame_rx_transpose() can be used to swap the dimensions,
resulting in the shape (FRAME_SIZE,MIC_COUNT).
Note that calls to ma_frame_rx() or ma_frame_rx_transpose() will block until a frame becomes available on the specified chanend.
If the receiving thread is not waiting to retrieve the audio frame from the mic array when it becomes available, the pipeline may back up and cause samples to be dropped. It is the responsibility of the application developer to ensure this does not happen.
The boolean template parameter USE_DCOE indicates whether the DC offset elimination filter should be applied to the output of the second stage decimator. DC offset elimination is an IIR filter intended to ensure audio samples on each channel tend towards zero-mean.
For more information about DC offset elimination, see
Sample Filters
.
If USE_DCOE is false, no further filtering of the second stage decimator’s output will occur.
The template parameter MICS_IN indicates the number of microphone channels to be captured by the PdmRx component of the mic array unit. This will often be the same as MIC_COUNT, but in some applications, MIC_COUNT microphones must be physically connected to an XCore port which is not MIC_COUNT (SDR) or MIC_COUNT/2 (DDR) bits wide.
In these cases, capturing the additional channels (likely not even physically connected to PDM microphones) is unavoidable, but further processing of the additional (junk) channels can be avoided by using MIC_COUNT<MICS_IN. The mapping which tells the mic array unit how to derive output channels from input channels can be configured during initialization by calling StandardPdmRxService::MapChannels() on the PdmRx sub-component of the BasicMicarray.
If the application uses an SDR microphone configuration (i.e. 1 microphone per port pin), then MICS_IN must be the same as the port width. If the application is running in a DDR microphone configuration, MICS_IN must be twice the port width. MICS_IN defaults to MIC_COUNT.
Allocation
Before a mic array unit can be started or initialized, it must be allocated.
Instances of BasicMicArray are self-contained with respect to memory, needing no external buffers to be supplied by the application. Allocating an instance is most easily accomplished by simply declaring the mic array unit. An example follows.
Here, mics is an allocated mic array unit. The example (and all that follow) assumes the macros used for template parameters are defined elsewhere.
Initialization
Before a mic array unit can be started, it must be initialized.
BasicMicArray reads PDM samples from an XCore port, and delivers frames of audio data over an XCore channel. To this end, an instance of BasicMicArray needs to be given the resource IDs of the port to be read and the chanend to transmit frames over. This can be accomplished in either of two ways.
If the resource IDs for the port and chanend are available as the mic array unit is being allocated, one option is to explicitly construct the BasicMicArray instance with the required resource IDs using the two-argument constructor:
Next, the ports and clock block(s) used by the PDM rx service need to be configured appropriately. This is not accomplished directly through the BasicMicArray object. Instead, a pdm_rx_resources_t struct representing these hardware resources is constructed and passed to mic_array_resources_configure(). See the documentation for pdm_rx_resources_t and mic_array_resources_configure() for more details.
Finally, if running BasicMicArray’s PDM rx service within an ISR, before the mic array unit can be started, the ISR must be installed. This is accomplished with a call to BasicMicArray::InstallPdmRxISR(). Installing the ISR will not unmask it.
Begin Processing (PDM rx ISR)
After it has been initialized, starting the mic array unit with the PDM rx service running as an ISR, three steps are required.
First, the PDM clock must be started. This is accomplished with a call to mic_array_pdm_clock_start(). The same pdm_rx_resources_t that was passed to mic_array_resources_configure() is given as an argument here.
Second, the PDM rx ISR that was installed during initialization must be unmasked. This is accomplished by calling BasicMicArray::UnmaskPdmRxISR() on the mic array unit.
Finally, the mic array processing thread must be started. The entry point for the mic array thread is BasicMicArray::ThreadEntry().
A typical pattern will include all three of these steps in a single function which wraps the mic array thread entry point.
AppMicArraymics;pdm_rx_resources_tpdm_res;...MA_C_API// alias for 'extern "C"'voidapp_mic_array_task(){mic_array_pdm_clock_start(&pdm_res);mics.UnmaskPdmRxISR();mics.ThreadEntry();}
Using this pattern, app_mic_array_task() is a C-compatible function which can be called from a multi-tile main() in an XC file. Then, app_mic_array_task() is called directly from a par{...} block. For example,
main(){...par{ontile[1]:{...// Do initialization stuffpar{app_mic_array_task();...other_thread_on_tile1();// other threads}}}}
Begin Processing (PDM Rx Thread)
The procedure for running the mic array unit with the PDM rx component running as a stand-alone thread is much the same with just a couple key differences.
mic_array_pdm_clock_start() must still be called, but here the requirement is that it be called from the hardware thread on which the PDM rx component is running (which, of course, cannot be the mic array thread).
A typical application with a multi-tile XC main() will provide two C-compatible functions - one for each thread:
Notice that app_mic_array_task() above is a thin wrapper for mics.ThreadEntry(). Unfortunately, because the type of mics is a C++ class, mics.ThreadEntry() cannot be called directly from an XC file (including the one containing main()). Further, because a C++ class template was used, this library cannot provide a generic C-compatible call wrapper for the methods on a MicArray object. This unfortunately means it is necessary in some cases to create a thin wrapper such as app_mic_array_task().
The threads are spawned from XC main using a par{...} block:
main(){...par{ontile[1]:{...// Do initialization stuffpar{app_mic_array_task();app_pdm_rx_task();...other_thread_on_tile1();// other threads}}}}
Real-Time Constraint
Once the PDM rx thread is launched or the PDM rx interrupt has been unmasked, PDM data will start being collected and reported to the decimator thread. The application then must start the decimator thread within one output sample time (i.e. sample time for the output of the second stage decimator) to avoid issues.
Once the mic array processing thread is running, the real-time constraint is active for the thread consuming the mic array unit’s output, and it must waiting to receive an audio frame within one frame time.
Examples
This library comes with examples which demonstrate how a mic array unit is used in an actual application. If you are encountering difficulties getting BasicMicArray to work, studying the provided examples may help.
Note
BasicMicArray::InstallPdmRxISR() installs the ISR on the hardware thread that calls the method. In most cases, installing it in the same thread as the decimator is the right choice.
Template Parameters:
MIC_COUNT – Number of microphone channels.
FRAME_SIZE – Number of samples in each output audio frame.
USE_DCOE – Whether DC offset elimination should be used.
If the communication resources required by BasicMicArray are known at construction time, this constructor can be used to avoid further initialization steps.
This constructor does not install the ISR for PDM rx, and so that must be done separately if PDM rx is to be run in interrupt mode.
Parameters:
p_pdm_mics – Port with PDM microphones
c_frames_out – (non-streaming) chanend used to transmit frames.
voidSetPort(port_tp_pdm_mics)
Set the PDM data port.
This function calls this->PdmRx.Init(p_pdm_mics).
This should be called during initialization.
Parameters:
p_pdm_mics – The port to receive PDM data on.
voidSetOutputChannel(chanend_tc_frames_out)
Set the audio frame output channel.
This function calls this->OutputHandler.FrameTx.SetChannel(c_frames_out).
This must be set prior to entrying the decimator task.
Parameters:
c_frames_out – The channel to send audio frames on.
Derivatives of this class template are intended to be used for the TPdmRx template parameter of MicArray, where it represents the MicArray::PdmRx component of the mic array.
An object derived from PdmRxService collects blocks of PDM samples from a port and makes them available to the decimation thread as the blocks are completed.
PdmRxService is a base class using CRTP. Subclasses extend PdmRxService providing themselves as the template parameter SubType.
This base class provides the logic for aggregating PDM data taken from a port into blocks, and a subclass is required to provide methods SubType::ReadPort(), SubType::SendBlock() and SubType::GetPdmBlock().
SubType::ReadPort() is responsible for reading 1 word of data from p_pdm_mics. See StandardPdmRxService::ReadPort() as an example.
SubType::SendBlock() is provided a block of PDM data as a pointer and is responsible for signaling that to the subsequent processing stage. See StandardPdmRxService::SendBlock() as an example.
ReadPort() and SendBlock() are used by PdmRxService itself (when running as a thread, rather than ISR).
SubType::GetPdmBlock() responsible for receiving a block of PDM data from SubType::SendBlock() as a pointer, deinterleaving the buffer contents, and returning a pointer to the PDM data in the format expected by the mic array unit’s decimator component. See StandardPdmRxService::GetPdmBlock() as an example.
GetPdmBlock() is called by the decimation thread. The pair of functions, SendBlock() and GetPdmBlock() facilitate inter-thread communication, SendBlock() being called by the transmitting end of the communication channel, and GetPdmBlock() being called by the receiving end.
Template Parameters:
BLOCK_SIZE – Number of words of PDM data per block.
SubType – Subclass of PdmRxService actually being used.
Public Functions
voidSetPort(port_tp_pdm_mics)
Set the port from which to collect PDM samples.
voidProcessNext()
Perform a port read and if a new block has completed, signal.
voidThreadEntry()
Entry point for PDM processing thread.
This function loops forever, calling ProcessNext() with each iteration.
Typically (e.g. TwoStageDecimator) BLOCK_SIZE will be exactly the number of words of PDM samples required to produce exactly one new output sample for the mic array unit’s output stream.
Once BlockSize words have been read into one of the block_data, buffers, PDM rx will signal to the decimator thread that new PDM data is available for processing.
Pointers to a pair of buffers used for storing captured PDM samples.
The buffers themselves are allocated by an instance of mic_array::PdmRxService. The idea is that while the PDM rx ISR is filling one buffer, the decimation thread is busy processing the contents of the other buffer. If the real-time constraint is maintained, the decimation thread will be finished with the contents of its buffer before the PDM rx ISR fills the other buffer. Once full, the PDM rx ISR does a double buffer pointer swap and hands the newly-filled buffer to the decimation thread.
unsignedphase
Tracks the completeness of the buffer currently being filled.
Each read of samples from p_pdm_mics gives one word of data. This variable tracks how many more port reads are required before the current buffer has been filled.
unsignedphase_reset
The number of words to read from p_pdn_mics to fill a buffer.
chanend_tc_pdm_data
Streaming chanend the PDM rx ISR uses to signal the decimation thread that another buffer is full and ready to be processed.
Used for detecting when the real-time constraint is violated by the decimation thread.
Each time the decimation thread is given a block of PDM data to process, credit is reset to 2. Each time the PDM rx ISR hands a block of PDM data to the decimation thread, this is decremented.
Deadlock Condition
mic_array::StandardPdmRxService uses a streaming channel to facilitate communication between the two execution contexts used by the mic array, the decimation thread and the PDM rx ISR. A streaming channel is used because it allows the contexts to operate asynchronously.
A channel has a 2 word buffer, and as long as there is room in the buffer, an OUT instruction putting a word (in this case, a pointer) into the channel is guaranteed not to block. This is important because the PDM rx ISR is typically configured on the same hardware thread as the decimation thread.
If a thread is blocked on an OUT instruction to a channel, in order to unblock the thread, an IN must be issued on the other end of that channel. But because the PDM rx ISR is blocked, it cannot hand control back to the decimation thread, which means the decimation thread can never issue an IN instruction to unblock the ISR. The result is a deadlock.
Unfortunately, there is no way for a thread to query a chanend to determine whether it will block if an OUT instruction is issued. That is why credit is used. Before issuing an OUT to c_pdm_data, the PDM rx ISR checks whether credit is non-zero. If so, the ISR issues the OUT instruction as normal and decrements credit.
If credit is zero, the default behavior of PDM rx ISR is to raise an exception (ET_ECALL). This reflects the idea that it is generally better if system-breaking errors loudly announce themselves (at least by default). If using mic_array::StandardPdmRxService, this behavior can be changed by passing false in a call to mic_array::StandardPdmRxService::AssertOnDroppedBlock(), which will allow blocks of PDM data to be silently dropped (while still avoiding a permanent deadlock).
unsignedmissed_blocks
Controls and records anti-deadlock behavior.
If the PDM rx ISR finds that credit is 0 when it’s time to send a filled buffer to the decimation thread, it uses missed_blocks to control whether the PDM rx ISR should raise an exception or silently drop the block of PDM data.
If missed_blocks is -1 (its default value) an exception is raised. Otherwise missed_blocks is used to record the number of blocks that have been quietly dropped.
PDM rx service which uses a streaming channel to send a block of data by pointer.
This class can run the PDM rx service either as a stand-alone thread or through an interrupt.
Inter-context Transfer
A streaming channel is used to transfer control of the PDM data block between execution contexts (i.e. thread->thread or ISR->thread).
The mic array unit receives blocks of PDM data from an instance of this class by calling GetPdmBlock(), which blocks until a new PDM block is available.
Layouts
The buffer transferred by SendBlock() contains CHANNELS_IN*SUBBLOCKS words of PDM data for CHANNELS_IN microphone channels. The words are stored in reverse order of arrival.
Within GetPdmBlock() (i.e. mic array thread) the PDM data block is deinterleaved and copied to another buffer in the format required by the decimator component, which is returned by GetPdmBlock(). This buffer contains samples for CHANNELS_OUT microphone channels.
Channel Filtering
In some cases an application may be required to capture more microphone channels than should actually be processed by subsequent processing stages (including the decimator component). For example, this may be the case if 4 microphone channels are desired but only an 8 bit wide port is physically available to capture the samples.
This class template has a parameter both for the number of channels to be captured by the port (CHANNELS_IN), as well as for the number of channels that are to be output for consumption by the MicArray’s decimator component (CHANNELS_OUT).
When the PDM microphones are in an SDR configuration, CHANNELS_IN must be the width (in bits) of the XCore port to which the microphones are physically connected. When in a DDR configuration, CHANNELS_IN must be twice the width (in bits) of the XCore port to which the microphones are physically connected.
CHANNELS_OUT is the number of microphone channels to be consumed by the mic array’s decimator component (i.e. must be the same as the MIC_COUNT template parameter of the decimator component). If all port pins are connected to microphones, this parameter will generally be the same as CHANNELS_IN.
Channel Index (Re-)Mapping
The input channel index of a microphone depends on the pin to which it is connected. Each pin connected to a port has a bit index for that port, given in the ‘Signal Description and GPIO’ section of your package’s datasheet.
Suppose an N-bit port is used to capture microphone data, and a microphone is connected to bit B of that port. In an SDR microphone configuration, the input channel index of that microphone is B, the same as the port bit index.
In a DDR configuration, that microphone will be on either input channel index B or B+N, depending on whether that microphone is configured for in-phase capture or out-of-phase capture.
Sometimes it may be desirable to re-order the microphone channel indices. This is likely the case, for example, when CHANNELS_IN>CHANNELS_OUT.
By default output channels are mapped from the input channels with the same index. If CHANNELS_IN>CHANNELS_OUT, this means that the input channels with the highest CHANNELS_IN-CHANNELS_OUT indices are dropped by default.
The MapChannel() and MapChannels() methods can be used to specify a non-default mapping from input channel indices to output channel indices. It takes a pointer to a CHANNELS_OUT-element array specifying the input channel index for each output channel.
Template Parameters:
CHANNELS_IN – The number of microphone channels to be captured by the port.
CHANNELS_OUT – The number of microphone channels to be delivered by this StandardPdmRxService instance.
SUBBLOCKS – The number of 32-sample sub-blocks to be captured for each microphone channel.
Public Functions
uint32_tReadPort()
Read a word of PDM data from the port.
Returns:
A uint32_t containing 32 PDM samples. If MIC_COUNT>=2 the samples from each port will be interleaved together.
Set the input-output mapping for a single output channel.
By default, input channel index k maps to output channel index k.
This method overrides that behavior for a single output channel, configuring output channel out_channel to be derived from input channel in_channel.
Note
Changing the channel mapping while the mic array unit is running is not recommended.
Parameters:
out_channel – Output channel index to be re-mapped.
in_channel – New source channel index for out_channel.
voidInstallISR()
Install ISR for PDM reception on the current core.
Note
This does not unmask interrupts.
voidUnmaskISR()
Unmask interrupts on the current core.
uint32_t*GetPdmBlock()
Get a block of PDM data.
Because blocks of PDM samples are delivered by pointer, the caller must either copy the samples or finish processing them before the next block of samples is ready, or the data will be clobbered.
Note
This is a blocking call.
Returns:
Pointer to block of PDM data.
voidAssertOnDroppedBlock(booldoAssert)
Set whether dropped PDM samples should cause an assertion.
If doAssert is set to true (default), the PDM rx ISR will raise an exception (ET_CALL) if it is ready to deliver a PDM block to the mic array thread when the mic array thread is not ready to receive it. If false, dropped blocks can be tracked through pdm_rx_isr_context.missed_blocks.
Sets the stage 1 and 2 filter coefficients. The decimator must be initialized before any calls to ProcessBlock().
s1_filter_coef points to a block of coefficients for the first stage decimator. This library provides coefficients for the first stage decimator; see mic_array/etc/filters_default.h.
s2_filter_coef points to an array of coefficients for the second stage decimator. This library provides coefficients for the second stage decimator where the second stage decimation factor is 6; see mic_array/etc/filters_default.h.
s2_filter_shr is the final right-shift applied to the stage 2 filter’s accumulator prior to output. See lib_xcore_math’s documentation of filter_fir_s32_t for more details.
Parameters:
s1_filter_coef –
Stage 1 filter coefficients.
This points to a block of coefficients for the first stage decimator. This library provides coefficients for the first stage decimator.
Processes a block of PDM data to produce an output sample from the second stage decimator.
pdm_block contains exactly enough PDM samples to produce a single output sample from the second stage decimator. The layout of pdm_block should (effectively) be:
struct{struct{// lower word indices are older samples.// less significant bits in a word are older samples.uint32_tsamples[S2_DEC_FACTOR];}microphone[MIC_COUNT];// mic channels are in ascending order}pdm_block;
A single output sample from the second stage decimator is computed and written to sample_out[].
An OutputHandler is a class which meets the requirements to be used as the
TOutputHandler template parameter of the
MicArray class template. The basic
requirement is that it have a method:
This method is how the mic array communicates its output with the rest of the
application’s audio processing pipeline. MicArray calls this method once for
each mic array output sample.
OutputHandler implementation which groups samples into non-overlapping multi-sample audio frames and sends entire frames to subsequent processing stages.
Classes derived from this template collect samples into frames. A frame is a 2 dimensional array with one index corresponding to the audio channel and the other index corresponding to time step, e.g.:
int32_tframe[MIC_COUNT][SAMPLE_COUNT];
Each call to OutputSample() adds the sample to the current frame, and then iff the frame is full, uses its FrameTx component to transfer the frame of audio to subsequent processing stages. Only one of every SAMPLE_COUNT calls to OutputSample() results in an actual transmission to subsequent stages.
With FrameOutputHandler, the thread receiving the audio will generally need to know how many microphone channels and how many samples to expect per frame (although, strictly speaking, that depends upon the chosen FrameTransmitter implementation).
Template Parameters:
MIC_COUNT –
The number of audio channels in each sample and each frame.
SAMPLE_COUNT – Number of samples per frame.
The SAMPLE_COUNT template parameter is the number of samples assembled into each audio frame. Only completed frames are transmitted to subsequent processing stages. A SAMPLE_COUNT value of 1 effectively disables framing, transmitting one sample for each call made to OutputSample.
FrameTransmitter –
The concrete type of the FrameTx component of this class.
The number of frame buffers an instance of FrameOutputHandler should cycle through. Unless audio frames are communicated with subsequent processing stages through shared memory, the default value of 1 is usualy ideal.
FrameTransmitter used to transmit frames to the next stage for processing.
FrameTransmitter is the CRTP type template parameter used in this class to control how frames of audio data are communicated with subsequent pipeline stages.
The type supplied for FrameTransmitter must be a class template with two integer template parameters, corresponding to this class’s MIC_COUNT and SAMPLE_COUNT template parameters respectively, indicating the shape of the frame object to be transmitted.
The FrameTransmitter type is required to implement a single method:
OutputFrame() is called once for each completed audio frame and is responsible for the details of how the frame’s data gets communicated to subsequent stages. For example, the ChannelFrameTransmitter class template uses an XCore channel to send samples to another thread (by value).
Alternative implementations might use shared memory or an RTOS queue to transmit the frame data, or might even use a port to signal the samples directly to an external DAC.
Frame transmitter which transmits frame over a channel.
This class template is meant for use as the FrameTransmitter template parameter of FrameOutputHandler.
When using this frame transmitter, frames are transmitted over a channel using the frame transfer API in mic_array/frame_transfer.h.
Usually, a call to ma_frame_rx() (with the other end of
c_frame_out as argument) should be used to receive the frame on
another thread.
If the receiving thread is not waiting to receive the frame when OutputFrame() is called, that method will block until the frame has been transmitted. In order to ensure there are no violations of the mic array’s real-time constraints, the receiver should be ready to receive a frame as soon as it becomes available.
Frames can be transmitted between tiles using this class.
Note
While OutputFrame() is blocking, it will not prevent the PDM rx interrupt from firing.
Template Parameters:
MIC_COUNT – Number of audio channels in each frame.
If this constructor is used, SetChannel() must be called to configure the channel over which frames are transmitted prior to any calls to OutputFrame().
PDM samples received on a port are shifted into a 32-bit buffer in such a way that the samples for each microphone channel are all interleaved with one another. The first stage decimator, however, requires these to be separated.
samples must point to a buffer containing (MIC_COUNT*s2_dec_factor) words of PDM data. Because the decimation factor for the first stage decimator is a fixed value of 32, 32 PDM samples from each microphone is enough to produce one output sample (a MIC_COUNT-element vector) from the first stage decimator. 32*s2_dec_factor PDM samples for each of the MIC_COUNT microphone channels is then exactly what is required to produce a single output sample from the second stage decimator.
The PDM data will be deinterleaved in-place.
On input, the format of the buffer to which samples points is assumed to be such that the following function will extract (only) the kth sample for microphone channel n (where k is a time index, not a memory index):
Here, the words of samples are stored in reverse order (older samples are at higher word indices), and within a word the oldest samples are the least significant bits. The LSb of a word is always microphone channel 0, and the MSb of a word is always microphone channel MIC_COUNT-1.
Upon return, the format of the buffer to which samples points will be such that the following function will extract (only) the kth sample for microphone channel n:
Here, each word contains samples from only a single channel, with words at higher addresses containing older samples. samples[0] contains the newest samples for microphone channel 0, and samples[MIC_COUNT-1] contains the newest samples for microphone channel MIC_COUNT-1. samples[MIC_COUNT] contains the next-oldest set of samples for channel 0, and so on.
The filters described below are the first and second stage filters provided by
this library which are used with the
TwoStageDecimator class template by
default.
XCORE ® -VOICE Solutions$$$lib_mic_array: PDM microphone array library$$$API Reference$$$C API Reference$$$filters_default.h$$$Stage 1 - PDM-to-PCM Decimating FIR Filter£££modules/io/modules/mic_array/doc/rst/src/reference/c/filters_default.html#stage-1-pdm-to-pcm-decimating-fir-filter
DecimationFactor:32TapCount:256
The first stage decimation FIR filter converts 1-bit PDM samples into 32-bit
PCM samples and simultaneously decimates by a factor of 32.
A typical input PDM sample rate will be 3.072M samples/sec, thus the
corresponding output sample rate will be 96k samples/sec.
The first stage filter uses 16-bit coefficients for its taps. Because
this is a highly optimized filter targeting the VPU hardware, the first
stage filter is presently restricted to using exactly 256 filter taps.
For more information about the example first stage filter supplied with the
library, including frequency response and steps for using a custom first stage
filter, see Decimator Stages.
STAGE1_DEC_FACTOR
Macro indicating Stage 1 Decimation Factor.
This is the ratio of input sample rate to output sample rate for the first filter stage.
Note
In version 5.0 of lib_mic_array, this value is fixed (even if you choose not to use the default filter coefficients).
STAGE1_TAP_COUNT
Macro indicating Stage 1 Filter Tap Count.
This is the number of filter taps in the first stage filter.
Note
In version 5.0 of lib_mic_array, this value is fixed (even if you choose not to use the default filter coefficients).
STAGE1_WORDS
Macro indicating Stage 1 Filter Word Count.
This is a helper macro to indicate the number of 32-bit words required to store the filter coefficients.
Note
Even though the coefficients are 16-bit, the related lib_mic_array structs and functions expect them to be contained in an array of uint32_t, rather than an array of int16_t. There are two reasons for this. The first is that the VPU instructions require loaded data to start at a word-aligned (0 mod 4) address. uint32_t allocated on the heap or stack are guaranteed by the compiler to be at word-aligned addresses. The second reason is to mitigate possible confusion regarding the arrangement of the filter coefficients in memory. Not only are the 16-bit coefficients not stored in order (e.g. b[0],b[1],b[2],...), the bits of individual 16-bit coefficients are not stored together in memory. This is, again, due to the behavior of the VPU hardware.
The second stage decimation FIR filter filters and downsamples the
32-bit PCM output stream from the first stage filter into another
32-bit PCM stream with sample rate reduced by the stage 2 decimation
factor.
A typical first stage output sample rate will be 96k samples/sec, a
decimation factor of 6 (i.e. using the default stage 2 filter) will
mean a second stage output sample rate of 16k samples/sec.
The second stage filter uses 32-bit coefficients for its taps. A
complete description of the FIR implementation is outside the scope
of this documentation, but it can be found in the `xs3_filter_fir_s32_t`
documentation of lib_xcore_math.
In brief, the second stage filter coefficients are quantized to a Q1.30
fixed-point format with input samples treated as integers. The tap outputs
are added into a 40-bit accumulator, and an output sample is produced by
applying a rounding arithmetic right-shift to the accumulator and then
clipping the result to the interval [INT32_MAX,INT32_MIN).
For more information about the example second stage filter supplies with the
library, including frequency response and steps for using a custom filter,
see Decimator Stages.
STAGE2_DEC_FACTOR
Stage 2 Decimation Factor for default filter.
This is the ratio of input sample rate to output sample rate for the second filter stage.
While the second stage filter can be configured with a different decimation factor, this is the one used for the filter supplied with this library.
STAGE2_TAP_COUNT
Stage 2 Filter tap count for default filter.
This is the number of filter taps associated with the second stage filter supplied with this library.
Collection of resources IDs required for PDM capture.
This struct is a container for the IDs of the XCore hardware resources used by the mic array unit’s PdmRx component for capturing PDM data from a port.
An object of this type will be used for initializing and starting the mic array unit.
Public Members
port_tp_mclk
Resource ID of the 1-bit port on which the master audio clock signal is received.
The master audio clock will be divided by a clock block to produce the PDM sample clock.
This port will be configured as an input.
port_tp_pdm_clk
Resource ID of the 1-bit port through which the PDM sample clock is signaled.
The PDM sample clock is used by the PDM microphones to trigger sample conversion.
This port will be configured as an output.
port_tp_pdm_mics
Resource ID of the port on which PDM samples are received.
In an SDR configuration, the number of microphone channels is the width of this port. In a DDR configuration, the number of microphone channels is twice the width of this port.
This port will be configured as an input.
clock_tclock_a
Resource ID of the clock block used to derive the PDM clock from the master audio clock.
In SDR configurations this is also the PDM data capture clock.
clock_tclock_b
Resource ID of the clock block used only in DDR configurations to trigger reads of the PDM data.
If operating in an SDR configuration, clock_b is 0. A value of 0 is what indicates an SDR configuration is being used.
Configure the hardware resources needed by the mic array.
Several hardware resources are needed to correctly run the mic array, including 3 ports and 1 or 2 clock blocks (depending on whether SDR or DDR mode is used). This function configures these resources for operation with the mic array.
The pdm_rx_resources_t struct is a container for identifying precisely these resources. All three ports are reset by this function; any existing port configuration will be clobbered.
The parameter divide is the ratio of the audio master clock to the desired PDM clock rate. For example, to generate a desired 3.072 MHz PDM clock from an audio master clock with frequency 24.576 MHz, a divide value of 8 is needed.
Divide can also be calculated from the master and PDM clock frequencies using
mic_array_mclk_divider().
pdm_res->p_mclk is the resource ID for the 1-bit port on which the audio master clock is received. This function will enable this port and configure it as the source port for pdm_res->clock_a and for pdm_res->clock_b if operating in a DDR configuration.
pdm_res->clock_a is the resource ID for the first (in SDR configuration, the only) clock block required by the mic array. Clock A divides the audio master clock (by a factor of divide) to generate the PDM clock. This function enables it with the audio master clock as its source.
pdm_res->p_pdm_clk is the resource ID for the 1-bit port from which the PDM clock will be signaled to the microphones. This function enables it and configures Clock A as its source clock.
pdm_res->clock_b is the resource ID for a second clock block, which is only required by the mic array in a DDR configuration. In DDR mode, this function enables Clock B with the audio master clock as its source. The divider for Clock B is half of that for Clock A (so it runs at twice the frequency). In a DDR configuration Clock B is used as the PDM capture clock. In an SDR configuration, this field must be set to 0 (this is how SDR/DDR is determined).
pdm_res->p_pdm_mics is the resource ID for the port on which PDM data is received. This function enables it and configures it as a 32-bit buffered input. If operating in an SDR configuration, Clock A is used as the capture clock. If operating in a DDR configuration, Clock B is used as its capture clock.
This function only configures and does not start either Clock A or Clock B. A call to mic_array_pdm_clock_start() with pdm_res as the argument can be used to start the clock(s).
This function should be called during initialization, before any PDM data can be captured or processed.
Parameters:
pdm_res – The hardware resources used by the mic array.
divide – The divider to generate the PDM clock from the master clock.
This function starts Clock A, and if using a DDR configuration, Clock B.
mic_array_resources_configure() must have been called already to configure the resources indicated in pdm_res.
Clock A is the PDM clock. Starting Clock A will cause pdm_res->p_pdm_clk to begin strobing the PDM clock to the PDM microphones.
In an SDR configuration, Clock A is also the capture clock. In a DDR configuration, Clock B is the capture clock. In either case, the capture clock is also started, causing pdm_res->p_pdm_mics to begin storing PDM samples received on each period of the capture clock.
In DDR configuration, this function starts Clock B, waits for a rising edge, and then starts Clock A, ensuring that the rising edges of the two clocks are not in phase.
This function must be called prior to launching the decimator or PDM rx threads.
Warning
Once this function has been called, the port receiving PDM data will begin capturing samples. If the mic array unit is not started by the time the port buffer fills ((32/mic_count) sample times) samples will begin to be dropped.
Parameters:
pdm_res – The hardware resources used by the mic array.
This is a convenience function which computes the required clock divider to derive a pdm_clock_freq Hz clock from a master_clock_freq Hz clock. This function is simple integer division.
Parameters:
master_clock_freq – The master audio clock frequency in Hz.
pdm_clock_freq – The desired PDM clock frequency in Hz.
This function transmits the 32-bit PCM frame frame[] over the channel c_frame_out.
This is a blocking call which will wait for a receiver to accept the data from the channel. Typically this will be accomplished with a call to ma_frame_rx() or ma_frame_rx_transpose().
The receiver is not required to be on the same tile as the sender.
Note
Internally, a channel transaction is established to reduce the overhead of channel communication. Any custom functions are used to receive this frame in an application, they must wrap the channel reads in a (slave) channel transaction. See xcore/channel_transaction.h.
Warning
No protocol is used to ensure consistency between the frame layout of the transmitter and receiver. Disagreement about frame size will likely cause one side to block indefinitely. It is the responsibility of the application author to ensure consistency between transmitter and receiver.
Parameters:
c_frame_out – Channel over which to send frame.
frame – Frame to be transmitted.
channel_count – Number of channels represented in the frame.
sample_count – Number of samples represented in the frame.
This function receives a PCM frame over c_frame_in. Normally, the frame will have been transmitted using ma_frame_tx(). The received frame is stored in frame[].
This is a blocking call which does not return until the frame has been fully received.
The sender is not required to be on the same tile as the receiver.
Note
Internally, a channel transaction is established to reduce the overhead of channel communication. This function may only be used to receive the frame if the transmitter has wrapped the channel writes in a (master) channel transaction. See xcore/channel_transaction.h.
Warning
No protocol is used to ensure consistency between the frame layout of the transmitter and receiver. Disagreement about frame size will likely cause one side to block indefinitely. It is the responsibility of the application author to ensure consistency between transmitter and receiver.
Parameters:
frame – Buffer to store received frame.
c_frame_in – Channel from which to receive frame.
channel_count – Number of channels represented in the frame.
sample_count – Number of samples represented in the frame.
Receive 32-bit PCM frame over a channel with transposed dimensions.
This function receives a PCM frame over c_frame_in. Normally, the frame will have been transmitted using ma_frame_tx(). The received frame is stored in frame[].
Unlike ma_frame_rx(), this function reorders the frame elements as they are received. ma_frame_tx() always transmits the frame elements in memory order. This function swaps the channel and sample axes so that if the transmitter frame has shape (CHANNEL,SAMPLE), the caller’s frame array will have shape (SAMPLE,CHANNEL).
This is a blocking call which does not return until the frame has been fully received.
The sender is not required to be on the same tile as the receiver.
Note
Internally, a channel transaction is established to reduce the overhead of channel communication. This function may only be used to receive the frame if the transmitter has wrapped the channel writes in a (master) channel transaction. See xcore/channel_transaction.h.
Warning
No protocol is used to ensure consistency between the frame layout of the transmitter and receiver. Disagreement about frame size will likely cause one side to block indefinitely. It is the responsibility of the application author to ensure consistency between transmitter and receiver.
Parameters:
frame – Buffer to store received frame.
c_frame_in – Channel from which to receive frame.
channel_count – Number of channels represented in the frame.
sample_count – Number of samples represented in the frame.
This is the required state information for a single channel to which the DC offset elimination filter is to be applied.
To apply the DC offset elimination filter to multiple channels simultaneously, an array of dcoe_chan_state_t should be used.
dcoe_state_init() is used once to initialize an array of state objects, and dcoe_filter() is used on each consecutive sample to apply the filter and get the resulting output sample.
DC offset elimination is an IIR filter. The state must persist between time steps.
Use in lib_mic_array
Typical users of lib_mic_array will not need to directly use this type or any functions which take it as a parameter.
The C++ class template mic_array::DcoeSampleFilter, if used in an application’s mic array unit, will allocate, initialize and apply the DCOE filter automatically.
When using the ‘vanilla’ API, DCOE is enabled by default. To disable DCOE when using this API, add a preprocessor definition to the compiler flags, setting MIC_ARRAY_CONFIG_USE_DC_ELIMINATION to 0.
Applies the DC offset elimination filter to get a new output sample and updates the filter state.
For correct behavior, this function should be called once per sample (here “sample” refers to a vector-valued quantity containing one element for each audio channel) of that stream.
The index of each array (state, new_input and new_output) corresponds to the audio channel. The update associated with each audio channel is independent of each other audio channel.
The equation used for each channel is:
y[t]=R*y[t-1]+x[t]-x[t-1]
where t is the current sample time index, y[] is the output signal, x[] is the input signal, and R is (252.0/256).
To filter a sample in-place use the same array for both the new_input and new_output arguments.
Parameters:
new_output – [out] Array into which the output sample will be placed.
state – [in] DC offset elimination state vector.
new_input – [in] New input sample.
chan_count – [in] Number of channels to be processed.
Initializes the mic array module. (Vanilla API only)
Initializes the contexts for the decimator thread and configures the clocks and ports for PDM reception.
After calling this, the PDM clock is active and signaling, but the PDM rx service (ISR) has not yet been activated, so received PDM samples are ignored. The real-time condition is not yet active.
Parameters:
pdm_res – Hardware resources required by the mic array module.
voidma_vanilla_task(chanend_tc_frames_out)
Entry point for decimator thread and PDM rx. (Vanilla API only)
This function sets up and activates the PDM rx service in ISR mode, and then immediately begins executing the decimator.
After calling this the real-time condition is active, meaning there must be another thread waiting to pull frames from the other end of c_frames_out as they become available.
Parameters:
c_frames_out – (Non-streaming) Channel over which to send processed frames of audio.
lib_xcore_math is a library of optimised math functions for taking advantage of the vector
processing unit (VPU) of the XMOS XS3 architecture (i.e xcore.ai).
Included in the library are functions for block floating-point arithmetic, fast Fourier transforms,
linear algebra, discrete cosine transforms, linear filtering and more.
This library is organised around several sub-APIs. These APIs collect the provided operations into
coherent groups based on the kind of operation or the types of object being acted upon.
lib_xcore_math is intended to be used with the XCommon CMake
, the XMOS application build and dependency management system.
lib_xcore_math can be compiled for both x86 platforms and XS3 based processors.
On x86 platforms you can develop DSP algorithms and test them for functional correctness;
this is an optional step before porting the library to an xcore device.
To use this module, include lib_xcore_math in the application’s APP_DEPENDENT_MODULES list and
include the xcore_math.h header file.
lib_xcore_math is a library containing efficient implementations of various mathematical
operations that may be required in an embedded application. In particular, this library is geared
towards operations which work on vectors or arrays of data, including vectorized arithmetic,
linear filtering, and fast Fourier transforms.
This library comprises several sub-APIs. Grouping of operations into sub-APIs is a matter of
conceptual convenience. In general, functions from a given API share a common prefix indicating
which API the function comes from, or the type of object on which it acts. Additionally, there is
some interdependence between these APIs.
These APIs are:
Block floating-point (BFP) API – High-level API providing operations on BFP
vectors. See Block Floating-Point background for an introduction to block floating-point. These functions
manage the exponents and headroom of input and output BFP vectors to avoid overflow and underflow
conditions.
Vector/Array API – Lower-level API which is used heavily by the BFP API.
As such, the operations available in this API are similar to those in the BFP API, but the user
will have to manage exponents and headroom on their own. Many of these routines are implemented
directly in optimized assembly to use the hardware as efficiently as possible.
Scalar API – Provides various operations on scalar objects. In particular,
these operations focus on simple arithmetic operations applied to non-IEEE 754 floating-point
objects, as well as optimized operations which are applied to IEEE 754 floats.
Filtering API – Provides access to linear filtering operations, including
16- and 32-bit FIR filters and 32-bit biquad filters.
Fast Fourier Transform (FFT) API – Provides both low-level and block
floating-point FFT implementations. Optimized FFT implementations are provided for real signals,
pairs of real signals, and for complex signals.
Discrete Cosine Transform (DCT) API – Provides functions which implement the
type-II (‘forward’) and
type-III (‘inverse’) DCT for
a variety of block lengths. Also provides a fast 8x8 two dimensional forward and inverse DCT.
All APIs are accessed by including the single header file:
In the BFP API the BFP vectors are C structures such as bfp_s16_t, bfp_s32_t, or
bfp_complex_s32_t, backed by a memory buffer. These objects contain a pointer to the data
carrying the content (mantissas) of the vector, as well as information about the length, headroom
and exponent of the BFP vector.
Below is the definition of bfp_s32_t from xmath/types.h.
C_TYPEtypedefstruct{/** Pointer to the underlying element buffer.*/int32_t*data;/** Exponent associated with the vector. */exponent_texp;/** Current headroom in the ``data[]`` */headroom_thr;/** Current size of ``data[]``, expressed in elements */unsignedlength;/** BFP vector flags. Users should not normally modify these manually. */bfp_flags_eflags;}bfp_s32_t;
Functions in the BFP API generally are prefixed with bfp_. More specifically, functions where
the ‘main’ operands are 32-bit BFP vectors are prefixed with bfp_s32_, whereas functions where
the ‘main’ operands are complex 16-bit BFP vectors are prefixed with bfp_complex_s16_, and so
on for the other BFP vector types.
Before calling these functions, the BFP vectors represented by the arguments must be initialized.
For bfp_s32_t this is accomplished with bfp_s32_init(). Initialization
requires that a buffer of sufficient size be provided to store the mantissa vector, as well as an
initial exponent. If the first usage of a BFP vector is as an output, then the exponent will not
matter, but the object must still be initialized before use. Additionally, the headroom of the
vector may be computed upon initialization; otherwise it is set to 0.
Here is an example of a 32-bit BFP vector being initialized.
#define LEN (20)//The object representing the BFP vectorbfp_s32_tbfp_vect;// buffer backing bfp_vectint32_tdata_buffer[LEN];for(inti=0;i<LEN;i++)data_buffer[i]=i;// The initial exponent associated with bfp_vectexponent_tinitial_exponent=0;// If non-zero, `bfp_s32_init()` will compute headroom currently present in data_buffer.// Otherwise, headroom is initialized to 0 (which is always safe but may not be optimal)unsignedcalculate_headroom=1;// Initialize the vector objectbfp_s32_init(&bfp_vec,data_buffer,initial_exponent,LEN,calculate_headroom);// Go do stuff with bfp_vect...
Once initialized, the exponent and mantissas of the vector can be accessed by bfp_vect.exp and
bfp_vect.data[] respectively, with the logical (floating-point) value of element k being
given by \(\mathtt{bfp\_vect.data[k]}\cdot2^{\mathtt{bfp\_vect.exp}}\).
The following snippet shows a function foo() which takes 3 BFP vectors, a, b and c,
as arguments. It multiplies together a and b element-wise, and then subtracts c from the
product. In this example both operations are performed in-place on a. (See
bfp_s32_mul() and bfp_s32_sub() for more information about those functions)
voidfoo(bfp_s32_t*a,constbfp_s32_t*b,constbfp_s32_t*c){// Multiply together a and b, updating a with the result.bfp_s32_mul(a,a,b);// Subtract c from the product, again updating a with the result.bfp_s32_sub(a,a,c);}
The caller of foo() can then access the results through a. Note that the pointer a->data
was not modified during this call.
The functions in the lower-level vector API are optimized for performance. They do very little to
protect the user from mangling their data by arithmetic saturation/overflows or underflows (although
they do provide the means to prevent this).
Functions in the vector API are generally prefixed with vect_. For example, functions which
operate primarily on 16-bit vectors are prefixed with vect_s16_.
Some functions are prefixed with chunk_ instead of vect_. A “chunk” is just a vector with a
fixed memory footprint (currently 32 bytes, or 8 32-bit elements) meant to match the width of the
architecture’s vector registers.
As an example of a function from the vector API, see vect_s32_mul() (from
vect_s32.h), which multiplies together two int32_t vectors element by element.
This function takes two int32_t arrays, b and c, as inputs and one int32_t array,
a, as output (in the case of vect_s32_mul(), it is safe to have a point to the
same buffer as b or c, computing the result in-place). length indicates the number of
elements in each array. The final two parameters, b_shr and c_shr, are the arithmetic
right-shifts applied to each element of b and c before they are multiplied together.
Why the right-shifts? In the case of 32-bit multiplication, the largest possible product is
\(2^{62}\), which will not fit in the 32-bit output vector. Applying positive arithmetic
right-shifts to the input vectors reduces the largest possible product. So, the shifts are there to
manage the headroom/size of the resulting product in order to maximize precision while avoiding
overflow or saturation.
The parameters are similar here, but instead of b_shr and c_shr, there’s only an a_shr.
In this case, the arithmetic right-shift a_shr is applied to the products of b and c.
In this case the right-shift is also unsigned – it can only be used to reduce the size of the
product.
Shifts like those in these two examples are very common in the vector API, as they are the main
mechanism for managing exponents and headroom. Whether the shifts are applied to inputs, outputs,
both, or only one input will depend on a number of factors. In the case of vect_s32_mul()
they are applied to inputs because the XS3 VPU includes a compulsory (hardware) right-shift of 30
bits on all products of 32-bit numbers, and so often inputs may need to be left-shifted (negative
shift) in order to avoid underflows. In the case of vect_s16_mul(), this is unnecessary
because no compulsory shift is included in 16-bit multiply-accumulates.
Functions in the vector API are in many cases closely tied to the instruction set architecture
for XS3. As such, if more efficient algorithms are found to perform an operation these low-level API
functions are more likely to change in future versions.
A standard (IEEE) floating-point object can exist either as a scalar, e.g.
//Single IEEE floating-point variablefloatfoo;
or as a vector, e.g.
//Array of IEEE floating-point variablesfloatfoo[20];
Standard floating-point values carry both a mantissa \(m\) and an exponent \(p\), such that
the logical value represented by such a variable is \(m\cdot2^p\). When you have a vector of
standard floating-point values, each element of the vector carries its own mantissa and its own
exponent: \(m[k]\cdot2^{p[k]}\).
By contrast, block floating-point objects have a vector of mantissas \(\bar{m}\) which all share
the same exponent \(p\), such that the logical value of the element at index \(k\) is
\(m[k]\cdot2^p\).
struct{// Array of mantissasint32_tmant[20];// Shared exponentint32_texp;}bfp_vect;
With a given exponent, \(p\), the largest value that can be represented by a 32-bit BFP vector
is given by a maximal mantissa (\(2^{31}-1\)), for a logical value of
\((2^{31}-1)\cdot2^p\). The smallest non-zero value that an element can represent is
\(1\cdot2^p\).
Because all elements must share a single exponent, in order to avoid overflow or saturation of the
largest magnitude values, the exponent of a BFP vector is constrained by the element with the
largest (logical) value. The drawback to this is that when the elements of a BFP vector represent a
large dynamic range – that is, where the largest magnitude element is many, many times larger than
the smallest (non-zero) magnitude element – the smaller magnitude elements effectively have fewer
bits of precision.
Consider a 2-element BFP vector intended to carry the values \(2^{20}\) and \(255 \cdot
2^{-10}\). One way this vector can be represented is to use an exponent of \(0\).
In the diagram above, the fractional bits (shown in red text) are discarded, as the mantissa is only
32 bits. Then, with \(0\) as the exponent, mant[1] underflows to \(0\). Meanwhile, the
12 most significant bits of mant[0] are all zeros.
The headroom of a signed integer is the number of redundant leading sign bits. Equivalently, it is
the number of bits that a mantissa can be left-shifted without losing any information. In the the
diagram, the bits corresponding to headroom are shown in green text. Here mant[0] has 10 bits of
headroom and mant[1] has a full 32 bits of headroom. (mant[0] does not have 11 bits of
headroom because in two’s complement the MSb serves as a sign bit). The headroom for a BFP vector is
the minimum of headroom amongst each of its elements; in this case, 10 bits.
If we remove headroom from one mantissa of a BFP vector, all other mantissas must shift by the same
number of bits, and the vector’s exponent must be adjusted accordingly. A left-shift of one bit
corresponds to reducing the exponent by 1, because a single bit left-shift corresponds to
multiplication by 2.
In this case, if we remove 10 bits of headroom and subtract 10 from the exponent we get the
following:
Now, no information is lost in either element. One of the main goals of BFP arithmetic is to keep
the headroom in BFP vectors to the minimum necessary (equivalently, keeping the exponent as small as
possible). That allows for maximum effective precision of the elements in the vector.
Note that the headroom of a vector also tells you something about the size of the largest magnitude
mantissa in the vector. That information (in conjunction with exponents) can be used to determine
the largest possible output of an operation without having to look at the mantissas.
For this reason, the BFP vectors in lib_xcore_math carry a field which tracks their current
headroom. The functions in the BFP API use this property to make determinations about how best to
preserve precision.
Each of the main operand types used in this library has a short-hand which is used as a prefix in
the naming of API operations. The following tables can be used for reference.
Complex 16-bit vectors are usually represented as a pair of 16-bit vectors. This is an optimization due to the word-alignment requirement when loading data into the VPU’s vector registers.
chunk_s32
int32_t[8]
A ‘chunk’ is a fixed size vector corresponding to the size of the VPU vector registers.
vect_qXX
int32_t[]
When used in an API function name, the XX will be an actual number (e.g. vect_q30_exp_small()) indicating the fixed-point interpretation used by that function.
The logical quantity represented by each element of this vector is: data[i]*2^(exp) where the multiplication and exponentiation are using real (non-modular) arithmetic.
The BFP API keeps the hr field up-to-date with the current headroom of data[] so as to minimize precision loss as elements become small.
The logical quantity represented by each element of this vector is: data[i]*2^(exp) where the multiplication and exponentiation are using real (non-modular) arithmetic.
The BFP API keeps the hr field up-to-date with the current headroom of data[] so as to minimize precision loss as elements become small. [bfp_s16_t]
The logical quantity represented by each element of this vector is: data[k].re*2^(exp)+i*data[k].im*2^(exp) where the multiplication and exponentiation are using real (non-modular) arithmetic, and i is sqrt(-1)
The BFP API keeps the hr field up-to-date with the current headroom of data[] so as to minimize precision loss as elements become small. [bfp_complex_s32_t]
The logical quantity represented by each element of this vector is: data[k].re*2^(exp)+i*data[k].im*2^(exp) where the multiplication and exponentiation are using real (non-modular) arithmetic, and i is sqrt(-1)
The BFP API keeps the hr field up-to-date with the current headroom of data[] so as to minimize precision loss as elements become small. [bfp_complex_s16_t]
Many places in this API make use of integers representing the exponent associated with some floating-point value or block floating-point vector.
For a floating-point value \(x \cdot 2^p\), \(p\) is the exponent, and may usually be positive or negative.
typedefunsignedheadroom_t
Headroom of some integer or integer array.
Represents the headroom of a signed or unsigned integer, complex integer or channel pair, or the headroom of the mantissa array of a block floating-point vector.
typedefintright_shift_t
A rightwards arithmetic bit-shift.
Represents a right bit-shift to be applied to an integer. May be signed or unsigned, depending on context. If signed, negative values represent leftward bit-shifts.
Represents a left bit-shift to be applied to an integer. May be signed or unsigned, depending on context. If signed, negative values represent rightward bit-shifts.
A complex floating-point scalar with a complex 16-bit mantissa.
Represents a (non-standard) complex floating-point value given by \( A + j\cdot B \cdot 2^{x}\), where \(A\) is mant.re, the 16-bit real part of the mantissa, \(B\) is mant.im, the 16-bit imaginary part of the mantissa, and \(x\) is the exponent exp.
A complex floating-point scalar with a complex 32-bit mantissa.
Represents a (non-standard) complex floating-point value given by \( A + j\cdot B \cdot 2^{x}
\), where \(A\) is mant.re, the 32-bit real part of the mantissa, \(B\) is mant.im, the 32-bit imaginary part of the mantissa, and \(x\) is the exponent exp.
A complex floating-point scalar with a complex 64-bit mantissa.
Represents a (non-standard) complex floating-point value given by \( A + j\cdot B \cdot 2^{x}\), where \(A\) is mant.re, the 64-bit real part of the mantissa, \(B\) is mant.im, the 64-bit imaginary part of the mantissa, and \(x\) is the exponent exp.
Holds a set of sixteen 32-bit accumulators in the XS3 VPU’s internal format.
The XS3 VPU stores 32-bit accumulators with the most significant 16-bits stored in one 256-bit vector register (called vD), and the least significant 16-bit stored in another 256-bit register (called vR). This struct reflects that internal format, and is occasionally used to store intermediate results.
Note
vR is unsigned. This reflects the fact that a signed 16-bit integer 0xSTUVWXYZ is always exactly 0x0000WXYZ larger than 0xSTUV0000. To combine the upper and lower 16-bits of an accumulator, use (((int32_t)vD[k])<<16)+vR[k].
The tables below list the functions of the block floating-point API. The “EW” column indicates
whether the operation acts element-wise.
The “Signature” column is intended as a hint which quickly conveys the kind of the conceptual inputs
to and outputs from the operation. The signatures are only intended to convey how many (conceptual)
inputs and outputs there are, and their dimensionality.
The functions themselves will typically take more arguments than these signatures indicate. Check
the function’s full documentation to get more detailed information.
The following symbols are used in the signatures:
Symbol
Description
\(\mathbb{S}\)
A scalar input or output value.
\(\mathbb{V}\)
A vector-valued input or output.
\(\mathbb{M}\)
A matrix-valued input or output.
\(\varnothing\)
Placeholder indicating no input or output.
For example, the operation signature \((\mathbb{V \times V \times S}) \to \mathbb{V}\) indicates
the operation takes two vector inputs and a scalar input, and the output is a vector.
This function initializes each of the fields of BFP vector a.
data points to the memory buffer used to store elements of the vector, so it must be at least length*2 bytes long, and must begin at a word-aligned address.
exp is the exponent assigned to the BFP vector. The logical value associated with the kth element of the vector after initialization is \( data_k \cdot 2^{exp} \).
If calc_hr is false, a->hr is initialized to 0. Otherwise, the headroom of the the BFP vector is calculated and used to initialize a->hr.
Parameters:
a – [out] BFP vector to initialize
data – [in]int16_t buffer used to back a
exp – [in] Exponent of BFP vector
length – [in] Number of elements in the BFP vector
calc_hr – [in] Boolean indicating whether the HR of the BFP vector should be calculated
Dynamically allocate a 16-bit BFP vector from the heap.
If allocation was unsuccessful, the data field of the returned vector will be NULL, and the length field will be zero. Otherwise, data will point to the allocated memory and the length field will be the user-specified length. The length argument must not be zero.
Neither the BFP exponent, headroom, nor the elements of the allocated mantissa vector are set by this function. To set the BFP vector elements to a known value, use bfp_s16_set() on the retuned BFP vector.
BFP vectors allocated using this function must be deallocated using bfp_s16_dealloc() to avoid a memory leak.
To initialize a BFP vector using static memory allocation, use bfp_s16_init() instead.
Dynamic allocation of BFP vectors relies on allocation from the heap, and offers no guarantees about the execution time. Use of this function in any time-critical section of code is highly discouraged.
Parameters:
length – [in] The length of the BFP vector to be allocated (in elements)
Deallocate a 16-bit BFP vector allocated by bfp_s16_alloc().
Use this function to free the heap memory allocated by bfp_s16_alloc().
BFP vectors whose mantissa buffer was (successfully) dynamically allocated have a flag set which indicates as much. This function can safely be called on any bfp_s16_t which has not had its flags or data manually manipulated, including:
The headroom of a vector is the number of bits its elements can be left-shifted without losing any information. It conveys information about the range of values that vector may contain, which is useful for determining how best to preserve precision in potentially lossy block floating-point operations.
In a BFP context, headroom applies to mantissas only, not exponents.
In particular, if the 16-bit mantissa vector \(\bar x\) has \(N\) bits of headroom, then for any element \(x_k\) of \(\bar x\)
\(-2^{15-N} \le x_k < 2^{15-N}\)
And for any element \(X_k = x_k \cdot 2^{x\_exp}\) of a complex BFP vector \(\bar X\)
Modify a 16-bit BFP vector to use a specified exponent.
This function forces BFP vector \(\bar A\) to use a specified exponent. The mantissa vector \(\bar a\) will be bit-shifted left or right to compensate for the changed exponent.
This function can be used, for example, before calling a fixed-point arithmetic function to ensure the underlying mantissa vector has the needed Q-format. As another example, this may be useful when communicating with peripheral devices (e.g. via I2S) that require sample data to be in a specified format.
Note that this sets the current encoding, and does not fix the exponent permanently (i.e. subsequent operations may change the exponent as usual).
If the required fixed-point Q-format is QX.Y, where Y is the number of fractional bits in the resulting mantissas, then the associated exponent (and value for parameter exp) is -Y.
a points to input BFP vector \(\bar A\), with mantissa vector \(\bar a\) and exponent \(a\_exp\). a is updated in place to produce resulting BFP vector \(\bar{\tilde{A}}\) with mantissa vector \(\bar{\tilde{a}}\) and exponent \(\tilde{a}\_exp\).
exp is \(\tilde{a}\_exp\), the required exponent. \(\Delta{}p = \tilde{a}\_exp - a\_exp\) is the required change in exponent.
If \(\Delta{}p = 0\), the BFP vector is left unmodified.
If \(\Delta{}p > 0\), the required exponent is larger than the current exponent and an arithmetic right-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When applying a right-shift, precision may be lost by discarding the \(\Delta{}p\) least significant bits.
If \(\Delta{}p < 0\), the required exponent is smaller than the current exponent and a left-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When left-shifting, saturation logic will be applied such that any element that can’t be represented exactly with the new exponent will saturate to the 16-bit saturation bounds.
The exponent and headroom of a are updated by this function.
Operation Performed
\[\begin{split}\begin{aligned}
& \Delta{}p = \tilde{a}\_exp - a\_exp \\
& \tilde{a_k} \leftarrow sat_{16}( a_k \cdot 2^{-\Delta{}p} ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{A} \text{ (in elements) }
\end{aligned}\end{split}\]
Apply a left-shift to the mantissas of a 16-bit BFP vector.
Each mantissa of input BFP vector \(\bar B\) is left-shifted b_shl bits and stored in the corresponding element of output BFP vector \(\bar A\).
This operation can be used to add or remove headroom from a BFP vector.
b_shl is the number of bits that each mantissa will be left-shifted. This shift is signed and arithmetic, so negative values for b_shl will right-shift the mantissas.
a and b must have been initialized (see bfp_s16_init()), and must be the same length.
This operation can be performed safely in-place on b.
Note that this operation bypasses the logic protecting the caller from saturation or underflows. Output values saturate to the symmetric 16-bit range (the open interval \((-2^{15},
2^{15})\)). To avoid saturation, b_shl should be no greater than the headroom of b (b->hr).
Operation Performed
\[\begin{split}\begin{aligned}
& a_k \leftarrow sat_{16}( \lfloor b_k \cdot 2^{b\_shl} \rfloor ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B} \\
& \qquad\text{ and } b_k \text{ and } a_k \text{ are the } k\text{th mantissas from }
\bar{B}\text{ and } \bar{A}\text{ respectively}
\end{aligned}\end{split}\]
Parameters:
a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
b_shl – [in] Signed arithmetic left-shift to be applied to mantissas of \(\bar B\).
Multiply one 16-bit BFP vector by another element-wise.
Multiply each element of input BFP vector \(\bar B\) by the corresponding element of input BFP vector \(\bar C\) and store the results in output BFP vector \(\bar A\).
a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot C_k \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Sum the elements of input BFP vector \(\bar B\) to get a result \(A = a \cdot 2^{a\_exp}\), which is returned. The returned value has a 32-bit mantissa.
\[\begin{split}\begin{aligned}
& A \leftarrow \sum_{k=0}^{N-1} \left( B_k \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Compute the inner product of two 16-bit BFP vectors.
Adds together the element-wise products of input BFP vectors \(\bar B\) and \(\bar C\) for a result \(A = a \cdot 2^{a\_exp}\), where \(a\) is the 64-bit mantissa of the result and \(a\_exp\) is its associated exponent. \(A\) is returned.
b and c must have been initialized (see bfp_s16_init()), and must be the same length.
Operation Performed
\[\begin{split}\begin{aligned}
& a \cdot 2^{a\_exp} \leftarrow \sum_{k=0}^{N-1} \left( B_k \cdot C_k \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Parameters:
b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)
Returns:
\(A\), the inner product of vectors \(\bar B\) and \(\bar C\)
Clamp the elements of a 16-bit BFP vector to a specified range.
Each element \(A_k\) of output BFP vector \(\bar A\) is set to the corresponding element \(B_k\) of input BFP vector \(\bar B\) if it is in the range \( [ L \cdot 2^{bound\_exp},
U \cdot 2^{bound\_exp} ] \), otherwise it is set to the nearest value inside that range.
a and b must have been initialized (see bfp_s16_init()), and must be the same length.
This operation can be performed safely in-place on b.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow \begin{cases}
L \cdot 2^{bound\_exp} & B_k < L \cdot 2^{bound\_exp} \\
U \cdot 2^{bound\_exp} & B_k > U \cdot 2^{bound\_exp} \\
B_k & otherwise
\end{cases} \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Parameters:
a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
lower_bound – [in] Mantissa of the lower clipping bound, \(L\)
upper_bound – [in] Mantissa of the upper clipping bound, \(U\)
bound_exp – [in] Shared exponent of the clipping bounds
Each element \(A_k\) of output BFP vector \(\bar A\) is set to the corresponding element \(B_k\) of input BFP vector \(\bar B\) if it is non-negative, otherwise it is set to \(0\).
a and b must have been initialized (see bfp_s16_init()), and must be the same length.
This operation can be performed safely in-place on b.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow \begin{cases}
0 & B_k < 0 \\
B_k & otherwise
\end{cases} \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Convert a 16-bit BFP vector into a 32-bit BFP vector.
Increases the bit-depth of each 16-bit element \(B_k\) of input BFP vector \(\bar B\) to 32 bits, and stores the 32-bit result in the corresponding element \(A_k\) of output BFP vector \(\bar A\).
Sum the absolute values of elements of a 16-bit BFP vector.
Sum the absolute values of elements of input BFP vector \(\bar B\) for a result \(A = a \cdot
2^{a\_exp}\), where \(a\) is a 32-bit mantissa and \(a\_exp\) is its associated exponent. \(A\) is returned.
\[\begin{split}\begin{aligned}
& A \leftarrow \sum_{k=0}^{N-1} \left| A_k \right| \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Parameters:
b – [in] Input BFP vector \(\bar B\)
Returns:
\(A\), the sum of absolute values of elements of \(\bar B\)
Computes \(A = a \cdot 2^{a\_exp}\), the mean value of elements of input BFP vector \(\bar B\), where \(a\) is the 16-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.
\[\begin{split}\begin{aligned}
& A \leftarrow \frac{1}{N} \sum_{k=0}^{N-1} \left( B_k \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the energy (sum of squared of elements) of a 16-bit BFP vector.
Computes \(A = a \cdot 2^{a\_exp}\), the sum of squares of elements of input BFP vector \(\bar B\), where \(a\) is the 64-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.
\[\begin{split}\begin{aligned}
& A \leftarrow \sum_{k=0}^{N-1} \left( B_k^2 \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the RMS value of elements of a 16-bit BFP vector.
Computes \(A = a \cdot 2^{a\_exp}\), the RMS value of elements of input BFP vector \(\bar B\), where \(a\) is the 32-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.
The RMS (root-mean-square) value of a vector is the square root of the sum of the squares of the vector’s elements.
\[\begin{split}\begin{aligned}
& A \leftarrow \sqrt{\frac{1}{N}\sum_{k=0}^{N-1} \left( B_k^2 \right) } \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
\[\begin{split}\begin{aligned}
& A \leftarrow max\left(B_0, B_1, ..., B_{N-1} \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the element-wise maximum of two 16-bit BFP vectors.
Each element of output vector \(\bar A\) is set to the maximum of the corresponding elements in the input vectors \(\bar B\) and \(\bar C\).
a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.
This operation can be performed safely in-place on b, but not on c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow max(B_k, C_k) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
\[\begin{split}\begin{aligned}
& A \leftarrow min\left(B_0, B_1, ..., B_{N-1} \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the element-wise minimum of two 16-bit BFP vectors.
Each element of output vector \(\bar A\) is set to the minimum of the corresponding elements in the input vectors \(\bar B\) and \(\bar C\).
a, b and c must have been initialized (see bfp_s16_init()), and must be the same length.
This operation can be performed safely in-place on b, but not on c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow min(B_k, C_k) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Get the index of the maximum value of a 16-bit BFP vector.
Finds \(a\), the index of the maximum value among the elements of input BFP vector \(\bar B\). \(a\) is returned by this function.
If i is the value returned, then the maximum value in \(\bar B\) is ldexp(b->data[i],b->exp).
Operation Performed
\[\begin{split}\begin{aligned}
& a \leftarrow argmax_k\left(b_k\right) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Notes
If there is a tie for maximum value, the lowest tying index is returned.
Parameters:
b – [in] Input vector
Returns:
\(a\), the index of the maximum value from \(\bar B\)
Get the index of the minimum value of a 16-bit BFP vector.
Finds \(a\), the index of the minimum value among the elements of input BFP vector \(\bar B\). \(a\) is returned by this function.
If i is the value returned then the minimum value in \(\bar B\) is ldexp(b->data[i],b->exp).
Operation Performed
\[\begin{split}\begin{aligned}
& a \leftarrow argmin_k\left(b_k\right) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Notes
If there is a tie for minimum value, the lowest tying index is returned.
Parameters:
b – [in] Input vector
Returns:
\(a\), the index of the minimum value from \(\bar B\)
Accumulate a 16-bit BFP vector into a 32-bit accumulator vector.
This function is used for efficiently accumulating a series of 16-bit BFP vectors into a 32-bit vector. Each call to this function adds a BFP vector \(\bar B\) into the persistent 32-bit accumulator vector \(\bar A\).
Eventually the value of \(\bar A\) will be needed for something other than simple accumulation, which requires converting from the XS3-native split accumulator representation given by the split_acc_s32_t struct, into a standard vector of int32_t. This can be accomplished using vect_s32_merge_accs(). From there, the int32_t vector can be dropped to a 16-bit vector with vect_s32_to_vect_s16() if needed.
Note, in order for this operation to work, \(\mathtt{b\_exp} - \mathtt{a\_exp}\) must be no greater than \(14\).
Proper use of this function requires some book-keeping on the part of the caller. In particular, the caller is responsible for tracking the exponent and monitoring the headroom of the accumulator vector \(\bar A\).
Usage
To begin a sequence of accumulation, start by clearing the contents of \(\bar A\) to all zeros. Then, an appropriate exponent for \(\bar A\) must be chosen. The only hard constraint is that the accumulator exponent, \(\mathtt{a\_exp}\) must be within \(14\) of \(\bar B\)’s exponent, \(\mathtt{b\_exp}\). If \(\mathtt{b\_exp}\) is unknown, the caller may choose to wait until the first \(\bar B\) is available before initializing \(\mathtt{a\_exp}\).
As vectors are accumulated into \(\bar A\) with multiple calls to this function, it becomes possible for \(\bar A\) to saturate for some element. Each call to this function returns the headroom of \(\bar A\) (note: no more than 15 bits of headroom will be reported). If \(\bar A\) has at least 1 bit of headroom, then a call to this function is guarranteed not to saturate.
The larger \(\mathtt{a\_exp}\) is compared to each \(\mathtt{b\_exp}\), the more 16-bit vectors can be accumulated before saturation becomes possible (and by virtue of that, the more efficiently accumulation can take place.). On the other hand, as long as \(\mathtt{a\_exp} \le \mathtt{b\_exp}\), there is no precision loss during accumulation. It is the responsibility of the caller to manage this trade-off.
If and when this function reports that \(\bar A\) has 0 headroom, if further accumulation is needed, the caller can handle this by increasing \(\mathtt{a\_exp}\). Increasing \(\mathtt{a\_exp}\) will require that the contents of the mantissa vector \(\bar a\) be right-shifted to avoid corrupting the value of \(\bar A\), making room for further accumulation in the process. Shifting the split accumulators can be accomplished with a call to vect_split_acc_s32_shr().
Finally, when accumulation is complete or the accumulator values must be used elsewhere, the split accumulator vector can be converted to simple int32_t vector with a call to vect_s32_merge_accs().
Parameters:
a – [inout] Mantissas of accumulator vector \(\bar A\)
a_exp – [in] Exponent of accumulator vector \(\bar A\)
This function initializes each of the fields of BFP vector a.
data points to the memory buffer used to store elements of the vector, so it must be at least length*4 bytes long, and must begin at a word-aligned address.
exp is the exponent assigned to the BFP vector. The logical value associated with the kth element of the vector after initialization is \( data_k \cdot 2^{exp} \).
If calc_hr is false, a->hr is initialized to 0. Otherwise, the headroom of the the BFP vector is calculated and used to initialize a->hr.
Parameters:
a – [out] BFP vector to initialize
data – [in]int32_t buffer used to back a
exp – [in] Exponent of BFP vector
length – [in] Number of elements in the BFP vector
calc_hr – [in] Boolean indicating whether the HR of the BFP vector should be calculated
Dynamically allocate a 32-bit BFP vector from the heap.
If allocation was unsuccessful, the data field of the returned vector will be NULL, and the length field will be zero. Otherwise, data will point to the allocated memory and the length field will be the user-specified length. The length argument must not be zero.
Neither the BFP exponent, headroom, nor the elements of the allocated mantissa vector are set by this function. To set the BFP vector elements to a known value, use bfp_s32_set() on the retuned BFP vector.
BFP vectors allocated using this function must be deallocated using bfp_s32_dealloc() to avoid a memory leak.
To initialize a BFP vector using static memory allocation, use bfp_s32_init() instead.
This function always allocates an extra 2 elements so that bfp_fft_unpack_mono() can safely be used, but these two elements will NOT be reflected in the returned vector length.
Note
Dynamic allocation of BFP vectors relies on allocation from the heap, and offers no guarantees about the execution time. Use of this function in any time-critical section of code is highly discouraged.
Parameters:
length – [in] The length of the BFP vector to be allocated (in elements)
Deallocate a 32-bit BFP vector allocated by bfp_s32_alloc().
Use this function to free the heap memory allocated by bfp_s32_alloc().
BFP vectors whose mantissa buffer was (successfully) dynamically allocated have a flag set which indicates as much. This function can safely be called on any bfp_s32_t which has not had its flags or data manually manipulated, including:
Modify a 32-bit BFP vector to use a specified exponent.
This function forces BFP vector \(\bar A\) to use a specified exponent. The mantissa vector \(\bar a\) will be bit-shifted left or right to compensate for the changed exponent.
This function can be used, for example, before calling a fixed-point arithmetic function to ensure the underlying mantissa vector has the needed Q-format. As another example, this may be useful when communicating with peripheral devices (e.g. via I2S) that require sample data to be in a specified format.
Note that this sets the current encoding, and does not fix the exponent permanently (i.e. subsequent operations may change the exponent as usual).
If the required fixed-point Q-format is QX.Y, where Y is the number of fractional bits in the resulting mantissas, then the associated exponent (and value for parameter exp) is -Y.
a points to input BFP vector \(\bar A\), with mantissa vector \(\bar a\) and exponent \(a\_exp\). a is updated in place to produce resulting BFP vector \(\tilde{A}\) with mantissa vector \(\tilde{a}\) and exponent \(\tilde{a}\_exp\).
exp is \(\tilde{a}\_exp\), the required exponent. \(\Delta{}p = \tilde{a}\_exp - a\_exp\) is the required change in exponent.
If \(\Delta{}p = 0\), the BFP vector is left unmodified.
If \(\Delta{}p > 0\), the required exponent is larger than the current exponent and an arithmetic right-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When applying a right-shift, precision may be lost by discarding the \(\Delta{}p\) least significant bits.
If \(\Delta{}p < 0\), the required exponent is smaller than the current exponent and a left-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When left-shifting, saturation logic will be applied such that any element that can’t be represented exactly with the new exponent will saturate to the 32-bit saturation bounds.
The exponent and headroom of a are updated by this function.
Operation Performed
\[\begin{split}\begin{aligned}
& \Delta{}p = \tilde{a}\_exp - a\_exp \\
& \tilde{a_k} \leftarrow sat_{32}( a_k \cdot 2^{-\Delta{}p} ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{A} \text{ (in elements) }
\end{aligned}\end{split}\]
The headroom of a vector is the number of bits its elements can be left-shifted without losing any information. It conveys information about the range of values that vector may contain, which is useful for determining how best to preserve precision in potentially lossy block floating-point operations.
In a BFP context, headroom applies to mantissas only, not exponents.
In particular, if the 32-bit mantissa vector \(\bar x\) has \(N\) bits of headroom, then for any element \(x_k\) of \(\bar x\)
\(-2^{31-N} \le x_k < 2^{31-N}\)
And for any element \(X_k = x_k \cdot 2^{x\_exp}\) of a complex BFP vector \(\bar X\)
Apply a left-shift to the mantissas of a 32-bit BFP vector.
Each mantissa of input BFP vector \(\bar B\) is left-shifted b_shl bits and stored in the corresponding element of output BFP vector \(\bar A\).
This operation can be used to add or remove headroom from a BFP vector.
b_shl is the number of bits that each mantissa will be left-shifted. This shift is signed and arithmetic, so negative values for b_shl will right-shift the mantissas.
a and b must have been initialized (see bfp_s32_init()), and must be the same length.
This operation can be performed safely in-place on b.
Note that this operation bypasses the logic protecting the caller from saturation or underflows. Output values saturate to the symmetric 32-bit range (the open interval \((-2^{31},
2^{31})\)). To avoid saturation, b_shl should be no greater than the headroom of b (b->hr).
Operation Performed
\[\begin{split}\begin{aligned}
& a_k \leftarrow sat_{32}( \lfloor b_k \cdot 2^{b\_shl} \rfloor ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B} \\
& \qquad\text{ and } b_k \text{ and } a_k \text{ are the } k\text{th mantissas from }
\bar{B}\text{ and } \bar{A}\text{ respectively}
\end{aligned}\end{split}\]
Parameters:
a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
b_shl – [in] Signed arithmetic left-shift to be applied to mantissas of \(\bar B\).
Multiply one 32-bit BFP vector by another element-wise.
Multiply each element of input BFP vector \(\bar B\) by the corresponding element of input BFP vector \(\bar C\) and store the results in output BFP vector \(\bar A\).
a, b and c must have been initialized (see bfp_s32_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot C_k \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Sum the elements of input BFP vector \(\bar B\) to get a result \(A = a \cdot 2^{a\_exp}\), which is returned. The returned value has a 64-bit mantissa.
\[\begin{split}\begin{aligned}
& A \leftarrow \sum_{k=0}^{N-1} \left( B_k \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Compute the inner product of two 32-bit BFP vectors.
Adds together the element-wise products of input BFP vectors \(\bar B\) and \(\bar C\) for a result \(A = a \cdot 2^{a\_exp}\), where \(a\) is the 64-bit mantissa of the result and \(a\_exp\) is its associated exponent. \(A\) is returned.
b and c must have been initialized (see bfp_s32_init()), and must be the same length.
Operation Performed
\[\begin{split}\begin{aligned}
& a \cdot 2^{a\_exp} \leftarrow \sum_{k=0}^{N-1} \left( B_k \cdot C_k \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Parameters:
b – [in] Input BFP vector \(\bar B\)
c – [in] Input BFP vector \(\bar C\)
Returns:
\(A\), the inner product of vectors \(\bar B\) and \(\bar C\)
Clamp the elements of a 32-bit BFP vector to a specified range.
Each element \(A_k\) of output BFP vector \(\bar A\) is set to the corresponding element \(B_k\) of input BFP vector \(\bar B\) if it is in the range \( [ L \cdot 2^{bound\_exp}, U \cdot 2^{bound\_exp} ] \), otherwise it is set to the nearest value inside that range.
a and b must have been initialized (see bfp_s32_init()), and must be the same length.
This operation can be performed safely in-place on b.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow \begin{cases}
L \cdot 2^{bound\_exp} & B_k < L \cdot 2^{bound\_exp} \\
U \cdot 2^{bound\_exp} & B_k > U \cdot 2^{bound\_exp} \\
B_k & otherwise
\end{cases} \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Parameters:
a – [out] Output BFP vector \(\bar A\)
b – [in] Input BFP vector \(\bar B\)
lower_bound – [in] Mantissa of the lower clipping bound, \(L\)
upper_bound – [in] Mantissa of the upper clipping bound, \(U\)
bound_exp – [in] Shared exponent of the clipping bounds
Each element \(A_k\) of output BFP vector \(\bar A\) is set to the corresponding element \(B_k\) of input BFP vector \(\bar B\) if it is non-negative, otherwise it is set to \(0\).
a and b must have been initialized (see bfp_s32_init()), and must be the same length.
This operation can be performed safely in-place on b.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow \begin{cases}
0 & B_k < 0 \\
B_k & otherwise
\end{cases} \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Convert a 32-bit BFP vector into a 16-bit BFP vector.
Reduces the bit-depth of each 32-bit element \(B_k\) of input BFP vector \(\bar B\) to 16 bits, and stores the 16-bit result in the corresponding element \(A_k\) of output BFP vector \(\bar A\).
Sum the absolute values of elements of a 32-bit BFP vector.
Sum the absolute values of elements of input BFP vector \(\bar B\) for a result \(A = a \cdot
2^{a\_exp}\), where \(a\) is a 64-bit mantissa and \(a\_exp\) is its associated exponent. \(A\) is returned.
\[\begin{split}\begin{aligned}
& A \leftarrow \sum_{k=0}^{N-1} \left| A_k \right| \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Parameters:
b – [in] Input BFP vector \(\bar B\)
Returns:
\(A\), the sum of absolute values of elements of \(\bar B\)
Computes \(A = a \cdot 2^{a\_exp}\), the mean value of elements of input BFP vector \(\bar B\), where \(a\) is the 32-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.
\[\begin{split}\begin{aligned}
& A \leftarrow \frac{1}{N} \sum_{k=0}^{N-1} \left( B_k \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the energy (sum of squared of elements) of a 32-bit BFP vector.
Computes \(A = a \cdot 2^{a\_exp}\), the sum of squares of elements of input BFP vector \(\bar B\), where \(a\) is the 64-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.
\[\begin{split}\begin{aligned}
& A \leftarrow \sum_{k=0}^{N-1} \left( B_k^2 \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the RMS value of elements of a 32-bit BFP vector.
Computes \(A = a \cdot 2^{a\_exp}\), the RMS value of elements of input BFP vector \(\bar B\), where \(a\) is the 32-bit mantissa of the result, and \(a\_exp\) is its associated exponent. \(A\) is returned.
The RMS (root-mean-square) value of a vector is the square root of the sum of the squares of the vector’s elements.
\[\begin{split}\begin{aligned}
& A \leftarrow \sqrt{\frac{1}{N}\sum_{k=0}^{N-1} \left( B_k^2 \right) } \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
\[\begin{split}\begin{aligned}
& A \leftarrow max\left(B_0, B_1, ..., B_{N-1} \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the element-wise maximum of two 32-bit BFP vectors.
Each element of output vector \(\bar A\) is set to the maximum of the corresponding elements in the input vectors \(\bar B\) and \(\bar C\).
a, b and c must have been initialized (see bfp_s32_init()), and must be the same length.
This operation can be performed safely in-place on b, but not on c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow max(B_k, C_k) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
\[\begin{split}\begin{aligned}
& A \leftarrow min\left(B_0, B_1, ..., B_{N-1} \right) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Get the element-wise minimum of two 32-bit BFP vectors.
Each element of output vector \(\bar A\) is set to the minimum of the corresponding elements in the input vectors \(\bar B\) and \(\bar C\).
a, b and c must have been initialized (see bfp_s32_init()), and must be the same length.
This operation can be performed safely in-place on b, but not on c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow min(B_k, C_k) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Get the index of the maximum value of a 32-bit BFP vector.
Finds \(a\), the index of the maximum value among the elements of input BFP vector \(\bar B\). \(a\) is returned by this function.
If i is the value returned, then the maximum value in \(\bar B\) is ldexp(b->data[i],b->exp).
Operation Performed
\[\begin{split}\begin{aligned}
& a \leftarrow argmax_k\left(b_k\right) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Notes
If there is a tie for maximum value, the lowest tying index is returned.
Parameters:
b – [in] Input vector
Returns:
\(a\), the index of the maximum value from \(\bar B\)
Get the index of the minimum value of a 32-bit BFP vector.
Finds \(a\), the index of the minimum value among the elements of input BFP vector \(\bar B\). \(a\) is returned by this function.
If i is the value returned, then the minimum value in \(\bar B\) is ldexp(b->data[i],b->exp).
Operation Performed
\[\begin{split}\begin{aligned}
& a \leftarrow argmin_k\left(b_k\right) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}
\end{aligned}\end{split}\]
Notes
If there is a tie for minimum value, the lowest tying index is returned.
Parameters:
b – [in] Input vector
Returns:
\(a\), the index of the minimum value from \(\bar B\)
Convolve a 32-bit BFP vector with a short convolution kernel (“valid” mode).
Input BFP vector \(\bar X\) is convolved with a short fixed-point convolution kernel \(\bar b\) to produce output BFP vector \(\bar Y\). In other words, this function applies the \(K\)th-order FIR filter with coefficients given by \(\bar b\) to the input signal \(\bar X\). The convolution is “valid” in the sense that no output elements are emitted where the filter taps extend beyond the bounds of the input vector, resulting in an output vector \(\bar Y\) with fewer elements.
The maximum filter order \(K\) supported by this function is \(7\).
y is the output vector \(\bar Y\). If input \(\bar X\) has \(N\) elements, and the filter has \(K\) coefficients, then \(\bar Y\) has \(N-2P\) elements, where \(P = \lfloor K / 2 \rfloor\).
x is the input vector \(\bar X\) with length \(N\) and elements.
b_q30[] is the vector \(\bar b\) of filter coefficients. The coefficients of \(\bar b\) are encoded in a Q2.30 fixed-point format. The effective value of the \(i\)th coefficient is then \(b_i \cdot 2^{-30}\).
b_length is the length \(K\) of \(\bar b\) in elements (i.e. the number of filter taps). b_length must be one of \( \{ 1, 3, 5, 7 \} \).
Convolve a 32-bit BFP vector with a short convolution kernel (“same” mode).
Input BFP vector \(\bar X\) is convolved with a short fixed-point convolution kernel \(\bar b\) to produce output BFP vector \(\bar Y\). In other words, this function applies the \(K\)th-order FIR filter with coefficients given by \(\bar b\) to the input signal \(\bar X\). The convolution mode is “same” in that the input vector is effectively padded such that the input and output vectors are the same length. The padding behavior is one of those given by pad_mode_e.
The maximum filter order \(K\) supported by this function is \(7\).
y and x are the output and input BFP vectors \(\bar Y\) and \(\bar X\) respectively.
b_q30[] is the vector \(\bar b\) of filter coefficients. The coefficients of \(\bar b\) are encoded in a Q2.30 fixed-point format. The effective value of the \(i\)th coefficient is then \(b_i \cdot 2^{-30}\).
b_length is the length \(K\) of \(\bar b\) in elements (i.e. the number of filter taps). b_length must be one of \( \{ 1, 3, 5, 7 \} \).
padding_mode is one of the values from the pad_mode_e enumeration. The padding mode indicates the filter input values for filter taps that have extended beyond the bounds of the input vector \(\bar X\). See pad_mode_e for a list of supported padding modes and associated behaviors.
Operation Performed
\[\begin{split}\begin{aligned}
& \tilde{x}_i = \begin{cases}
\text{determined by padding mode} & i < 0 \\
\text{determined by padding mode} & i \ge N \\
x_i & otherwise \end{cases} \\
& y_k \leftarrow \sum_{l=0}^{K-1} (\tilde{x}_{(k+l-P)} \cdot b_l \cdot 2^{-30} ) \\
& \qquad\text{ for }k\in 0\ ...\ (N-2P) \\
& \qquad\text{ where }P = \lfloor K/2 \rfloor
\end{aligned}\end{split}\]
This function initializes each of the fields of BFP vector a.
Unlike complex 32-bit BFP vectors (bfp_complex_s16_t), for the sake of various optimizations the real and imaginary parts of elements’ mantissas are stored in separate memory buffers.
real_data points to the memory buffer used to store the real part of each mantissa. It must be at least length*2 bytes long, and must begin at a word-aligned address.
imag_data points to the memory buffer used to store the imaginary part of each mantissa. It must be at least length*2 bytes long, and must begin at a word-aligned address.
exp is the exponent assigned to the BFP vector. The logical value associated with the kth element of the vector after initialization is \( data_k \cdot 2^{exp} \).
If calc_hr is false, a->hr is initialized to 0. Otherwise, the headroom of the the BFP vector is calculated and used to initialize a->hr.
Parameters:
a – [out] BFP vector to initialize
real_data – [in]int16_t buffer used to back the real part of a
imag_data – [in]int16_t buffer used to back the imaginary part of a
exp – [in] Exponent of BFP vector
length – [in] Number of elements in BFP vector
calc_hr – [in] Boolean indicating whether the HR of the BFP vector should be calculated
Dynamically allocate a complex 16-bit BFP vector from the heap.
If allocation was unsuccessful, the real and imag fields of the returned vector will be NULL, and the length field will be zero. Otherwise, real and imag will point to the allocated memory and the length field will be the user-specified length. The length argument must not be zero.
This function allocates a single block of memory for both the real and imaginary parts of the BFP vector. Because all BFP functions require the mantissa buffers to begin at a word- aligned address, if length is odd, this function will allocate an extra int16_t element for the buffer.
Neither the BFP exponent, headroom, nor the elements of the allocated mantissa vector are set by this function. To set the BFP vector elements to a known value, use bfp_complex_s16_set() on the retuned BFP vector.
BFP vectors allocated using this function must be deallocated using bfp_complex_s16_dealloc() to avoid a memory leak.
To initialize a BFP vector using static memory allocation, use bfp_complex_s16_init() instead.
Dynamic allocation of BFP vectors relies on allocation from the heap, and offers no guarantees about the execution time. Use of this function in any time-critical section of code is highly discouraged.
Parameters:
length – [in] The length of the BFP vector to be allocated (in elements)
BFP vectors whose mantissa buffer was (successfully) dynamically allocated have a flag set which indicates as much. This function can safely be called on any bfp_complex_s16_t which has not had its flags or real manually manipulated, including:
Modify a complex 16-bit BFP vector to use a specified exponent.
This function forces complex BFP vector \(\bar A\) to use a specified exponent. The mantissa vector \(\bar a\) will be bit-shifted left or right to compensate for the changed exponent.
This function can be used, for example, before calling a fixed-point arithmetic function to ensure the underlying mantissa vector has the needed Q-format. As another example, this may be useful when communicating with peripheral devices (e.g. via I2S) that require sample data to be in a specified format.
Note that this sets the current encoding, and does not fix the exponent permanently (i.e. subsequent operations may change the exponent as usual).
If the required fixed-point Q-format is QX.Y, where Y is the number of fractional bits in the resulting mantissas, then the associated exponent (and value for parameter exp) is -Y.
a points to input BFP vector \(\bar A\), with complex mantissa vector \(\bar a\) and exponent \(a\_exp\). a is updated in place to produce resulting BFP vector \(\bar{\tilde{A}}\) with complex mantissa vector \(\bar{\tilde{a}}\) and exponent \(\tilde{a}\_exp\).
exp is \(\tilde{a}\_exp\), the required exponent. \(\Delta{}p = \tilde{a}\_exp - a\_exp\) is the required change in exponent.
If \(\Delta{}p = 0\), the BFP vector is left unmodified.
If \(\Delta{}p > 0\), the required exponent is larger than the current exponent and an arithmetic right-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When applying a right-shift, precision may be lost by discarding the \(\Delta{}p\) least significant bits.
If \(\Delta{}p < 0\), the required exponent is smaller than the current exponent and a left-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When left-shifting, saturation logic will be applied such that any element that can’t be represented exactly with the new exponent will saturate to the 16-bit saturation bounds.
The exponent and headroom of a are updated by this function.
Operation Performed
\[\begin{split}\begin{aligned}
& \Delta{}p = \tilde{a}\_exp - a\_exp \\
& \tilde{a_k} \leftarrow sat_{16}( a_k \cdot 2^{-\Delta{}p} ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{A} \text{ (in elements) }
\end{aligned}\end{split}\]
The headroom of a complex vector is the number of bits that the real and imaginary parts of each of its elements can be left-shifted without losing any information. It conveys information about the range of values that vector may contain, which is useful for determining how best to preserve precision in potentially lossy block floating-point operations.
In a BFP context, headroom applies to mantissas only, not exponents.
In particular, if the complex 16-bit mantissa vector \(\bar x\) has \(N\) bits of headroom, then for any element \(x_k\) of \(\bar x\)
\(-2^{15-N} \le Re\{x_k\} < 2^{15-N}\)
and
\(-2^{15-N} \le Im\{x_k\} < 2^{15-N}\)
And for any element \(X_k = x_k \cdot 2^{x\_exp}\) of a complex BFP vector \(\bar X\)
Apply a left-shift to the mantissas of a complex 16-bit BFP vector.
Each complex mantissa of input BFP vector \(\bar B\) is left-shifted b_shl bits and stored in the corresponding element of output BFP vector \(\bar A\).
This operation can be used to add or remove headroom from a BFP vector.
b_shr is the number of bits that the real and imaginary parts of each mantissa will be left-shifted. This shift is signed and arithmetic, so negative values for b_shl will right-shift the mantissas.
a and b must have been initialized (see bfp_complex_s16_init()), and must be the same length.
This operation can be performed safely in-place on b.
Note that this operation bypasses the logic protecting the caller from saturation or underflows. Output values saturate to the symmetric 16-bit range (the open interval \((-2^{15},
2^{15})\)). To avoid saturation, b_shl should be no greater than the headroom of b (b->hr).
Operation Performed
\[\begin{split}\begin{aligned}
& Re\{a_k\} \leftarrow sat_{16}( \lfloor Re\{b_k\} \cdot 2^{b\_shl} \rfloor ) \\
& Im\{a_k\} \leftarrow sat_{16}( \lfloor Im\{b_k\} \cdot 2^{b\_shl} \rfloor ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B} \\
& \qquad\text{ and } b_k \text{ and } a_k \text{ are the } k\text{th mantissas from }
\bar{B}\text{ and } \bar{A}\text{ respectively}
\end{aligned}\end{split}\]
Parameters:
a – [out] Complex output BFP vector \(\bar A\)
b – [in] Complex input BFP vector \(\bar B\)
b_shl – [in] Signed arithmetic left-shift to be applied to mantissas of \(\bar B\).
Multiply a complex 16-bit BFP vector element-wise by a real 16-bit BFP vector.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vector \(\bar B\) and real input BFP vector \(\bar C\) respectively.
Multiply one complex 16-bit BFP vector element-wise another.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vectors \(\bar B\) and \(\bar C\) respectively.
a, b and c must have been initialized (see bfp_complex_s16_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot C_k \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Multiply one complex 16-bit BFP vector element-wise by the complex conjugate of another.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\), the corresponding element of complex input BFP vectors \(\bar B\), and \((C_k)^*\), the complex conjugate of the corresponding element of complex input BFP vector \(\bar C\).
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot (C_k)^* \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} \\
& \qquad\text{and } (C_k)^* \text{ is the complex conjugate of } C_k
\end{aligned}\end{split}\]
Multiply a complex 16-bit BFP vector by a real scalar.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\), and real scalar \(\alpha\cdot 2^{\alpha\_exp}\), where \(\alpha\) and \(\alpha\_exp\) are the mantissa and exponent respectively of parameter alpha. a and b must have been initialized (see bfp_complex_s16_init()), and must be the same length.
This operation can be performed safely in-place on b.
Multiply a complex 16-bit BFP vector by a complex scalar.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\), and complex scalar \(\alpha\cdot 2^{\alpha\_exp}\), where \(\alpha\) and \(\alpha\_exp\) are the complex mantissa and exponent respectively of parameter alpha.
a and b must have been initialized (see bfp_complex_s16_init()), and must be the same length.
This operation can be performed safely in-place on b.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the sum of \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vectors \(\bar B\) and \(\bar C\) respectively.
a, b and c must have been initialized (see bfp_complex_s16_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Subtract one complex 16-bit BFP vector from another.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the difference between \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vectors \(\bar B\) and \(\bar C\) respectively.
a, b and c must have been initialized (see bfp_complex_s16_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Convert a complex 16-bit BFP vector to a complex 32-bit BFP vector.
Each complex 32-bit output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the value of \(B_k\), the corresponding element of complex 16-bit input BFP vector \(\bar B\), sign-extended to 32 bits.
Get the squared magnitude of each element of a complex 16-bit BFP vector.
Each element \(A_k\) of real output BFP vector \(\bar A\) is set to the squared magnitude of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\).
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot (B_k)^* \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B} \\
& \qquad\text{ and } (B_k)^* \text{ is the complex conjugate of } B_k
\end{aligned}\end{split}\]
Get the magnitude of each element of a complex 16-bit BFP vector.
Each element \(A_k\) of real output BFP vector \(\bar A\) is set to the magnitude of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\).
Get the sum of elements of a complex 16-bit BFP vector.
The elements of complex input BFP vector \(\bar B\) are summed together. The result is a complex 32-bit floating-point scalar \(a\), which is returned.
Get the complex conjugate of each element of a complex 16-bit BFP vector.
Each element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex conjugate of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\).
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k^* \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} \\
& \qquad\text{and } B_k^* \text{ is the complex conjugate of } B_k
\end{aligned}\end{split}\]
This function initializes each of the fields of a.
Unlike bfp_complex_s16_t, complex 32-bit BFP vectors use a single buffer to store the real and imaginary parts of each mantissa, such that the imaginary part of element k follows the real part of element k in memory. data points to the memory buffer used to store elements of the vector, and must be at least length*8 bytes long.
exp is the exponent assigned to the BFP vector. The logical value associated with the kth complex element of the vector after initialization will be \( \left(data_{2k} + i\cdot
data_{2k+1} \right)\cdot2^{exp} \).
If calc_hr is false, a->hr is initialized to 0. Otherwise, the headroom of the the BFP vector is calculated and used to initialize a->hr.
Dynamically allocate a complex 32-bit BFP vector from the heap.
If allocation was unsuccessful, the data field of the returned vector will be NULL, and the length field will be zero. Otherwise, data will point to the allocated memory and the length field will be the user-specified length. The length argument must not be zero.
Neither the BFP exponent, headroom, nor the elements of the allocated mantissa vector are set by this function. To set the BFP vector elements to a known value, use bfp_complex_s32_set() on the retuned BFP vector.
BFP vectors allocated using this function must be deallocated using bfp_complex_s32_dealloc() to avoid a memory leak.
To initialize a BFP vector using static memory allocation, use bfp_complex_s32_init() instead.
Dynamic allocation of BFP vectors relies on allocation from the heap, and offers no guarantees about the execution time. Use of this function in any time-critical section of code is highly discouraged.
Parameters:
length – [in] The length of the BFP vector to be allocated (in elements)
BFP vectors whose mantissa buffer was (successfully) dynamically allocated have a flag set which indicates as much. This function can safely be called on any bfp_complex_s32_t which has not had its flags or data manually manipulated, including:
Modify a complex 32-bit BFP vector to use a specified exponent.
This function forces complex BFP vector \(\bar A\) to use a specified exponent. The mantissa vector \(\bar a\) will be bit-shifted left or right to compensate for the changed exponent.
This function can be used, for example, before calling a fixed-point arithmetic function to ensure the underlying mantissa vector has the needed Q-format. As another example, this may be useful when communicating with peripheral devices (e.g. via I2S) that require sample data to be in a specified format.
Note that this sets the current encoding, and does not fix the exponent permanently (i.e. subsequent operations may change the exponent as usual).
If the required fixed-point Q-format is QX.Y, where Y is the number of fractional bits in the resulting mantissas, then the associated exponent (and value for parameter exp) is -Y.
a points to input BFP vector \(\bar A\), with complex mantissa vector \(\bar a\) and exponent \(a\_exp\). a is updated in place to produce resulting BFP vector \( \tilde{A} \) with complex mantissa vector \( \tilde{a} \) and exponent \(\tilde{a}\_exp\).
exp is \(\tilde{a}\_exp\), the required exponent. \(\Delta{}p = \tilde{a}\_exp - a\_exp\) is the required change in exponent.
If \(\Delta{}p = 0\), the BFP vector is left unmodified.
If \(\Delta{}p > 0\), the required exponent is larger than the current exponent and an arithmetic right-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When applying a right-shift, precision may be lost by discarding the \(\Delta{}p\) least significant bits.
If \(\Delta{}p < 0\), the required exponent is smaller than the current exponent and a left-shift of \(\Delta{}p\) bits is applied to the mantissas \(\bar a\). When left-shifting, saturation logic will be applied such that any element that can’t be represented exactly with the new exponent will saturate to the 32-bit saturation bounds.
The exponent and headroom of a are updated by this function.
Operation Performed
\[\begin{split}\begin{aligned}
& \Delta{}p = \tilde{a}\_exp - a\_exp \\
& \tilde{a_k} \leftarrow sat_{32}( a_k \cdot 2^{-\Delta{}p} ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{A} \text{ (in elements) }
\end{aligned}\end{split}\]
The headroom of a complex vector is the number of bits that the real and imaginary parts of each of its elements can be left-shifted without losing any information. It conveys information about the range of values that vector may contain, which is useful for determining how best to preserve precision in potentially lossy block floating-point operations.
In a BFP context, headroom applies to mantissas only, not exponents.
In particular, if the complex 32-bit mantissa vector \(\bar x\) has \(N\) bits of headroom, then for any element \(x_k\) of \(\bar x\)
\(-2^{31-N} \le Re\{x_k\} < 2^{31-N}\)
and
\(-2^{31-N} \le Im\{x_k\} < 2^{31-N}\)
And for any element \(X_k = x_k \cdot 2^{x\_exp}\) of a complex BFP vector \(\bar X\)
Apply a left-shift to the mantissas of a complex 32-bit BFP vector.
Each complex mantissa of input BFP vector \(\bar B\) is left-shifted b_shl bits and stored in the corresponding element of output BFP vector \(\bar A\).
This operation can be used to add or remove headroom from a BFP vector.
b_shl is the number of bits that the real and imaginary parts of each mantissa will be left-shifted. This shift is signed and arithmetic, so negative values for b_shl will right-shift the mantissas.
a and b must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b.
Note that this operation bypasses the logic protecting the caller from saturation or underflows. Output values saturate to the symmetric 32-bit range (the open interval \((-2^{31},
2^{31})\)). To avoid saturation, b_shl should be no greater than the headroom of b (b->hr).
Operation Performed
\[\begin{split}\begin{aligned}
& Re\{a_k\} \leftarrow sat_{32}( \lfloor Re\{b_k\} \cdot 2^{b\_shl} \rfloor ) \\
& Im\{a_k\} \leftarrow sat_{32}( \lfloor Im\{b_k\} \cdot 2^{b\_shl} \rfloor ) \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B} \\
& \qquad\text{ and } b_k \text{ and } a_k \text{ are the } k\text{th mantissas from }
\bar{B}\text{ and } \bar{A}\text{ respectively}
\end{aligned}\end{split}\]
Parameters:
a – [out] Complex output BFP vector \(\bar A\)
b – [in] Complex input BFP vector \(\bar B\)
b_shl – [in] Signed arithmetic left-shift to be applied to mantissas of \(\bar B\).
Multiply a complex 32-bit BFP vector element-wise by a real 32-bit BFP vector.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vector \(\bar B\) and real input BFP vector \(\bar C\) respectively.
Multiply one complex 32-bit BFP vector element-wise by another.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vectors \(\bar B\) and \(\bar C\) respectively.
a, b and c must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot C_k \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C}
\end{aligned}\end{split}\]
Multiply one complex 32-bit BFP vector element-wise by the complex conjugate of another.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\), the corresponding element of complex input BFP vectors \(\bar B\), and \((C_k)^*\), the complex conjugate of the corresponding element of complex input BFP vector \(\bar C\).
a, b and c must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot (C_k)^* \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} \\
& \qquad\text{and } (C_k)^* \text{ is the complex conjugate of } C_k
\end{aligned}\end{split}\]
Multiply a complex 32-bit BFP vector by a real scalar.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\), and real scalar \(\alpha\cdot 2^{\alpha\_exp}\), where \(\alpha\) and \(\alpha\_exp\) are the mantissa and exponent respectively of parameter alpha.
a and b must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b.
Multiply a complex 32-bit BFP vector by a complex scalar.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex product of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\), and complex scalar \(\alpha\cdot 2^{\alpha\_exp}\), where \(\alpha\) and \(\alpha\_exp\) are the complex mantissa and exponent respectively of parameter alpha.
a and b must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the sum of \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vectors \(\bar B\) and \(\bar C\) respectively.
a, b and c must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Subtract one complex 32-bit BFP vector from another.
Each complex output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the difference between \(B_k\) and \(C_k\), the corresponding elements of complex input BFP vectors \(\bar B\) and \(\bar C\) respectively.
a, b and c must have been initialized (see bfp_complex_s32_init()), and must be the same length.
This operation can be performed safely in-place on b or c.
Convert a complex 32-bit BFP vector to a complex 16-bit BFP vector.
Each complex 16-bit output element \(A_k\) of complex output BFP vector \(\bar A\) is set to the value of \(B_k\), the corresponding element of complex 32-bit input BFP vector \(\bar B\), with its bit-depth reduced to 16 bits.
Get the squared magnitude of each element of a complex 32-bit BFP vector.
Each element \(A_k\) of real output BFP vector \(\bar A\) is set to the squared magnitude of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\).
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k \cdot (B_k)^* \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B} \\
& \qquad\text{ and } (B_k)^* \text{ is the complex conjugate of } B_k
\end{aligned}\end{split}\]
Get the magnitude of each element of a complex 32-bit BFP vector.
Each element \(A_k\) of real output BFP vector \(\bar A\) is set to the magnitude of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\).
Get the sum of elements of a complex 32-bit BFP vector.
The elements of complex input BFP vector \(\bar B\) are summed together. The result is a complex 64-bit floating-point scalar \(a\), which is returned.
Get the complex conjugate of each element of a complex 32-bit BFP vector.
Each element \(A_k\) of complex output BFP vector \(\bar A\) is set to the complex conjugate of \(B_k\), the corresponding element of complex input BFP vector \(\bar B\).
Operation Performed
\[\begin{split}\begin{aligned}
& A_k \leftarrow B_k^* \\
& \qquad\text{for } k \in 0\ ...\ (N-1) \\
& \qquad\text{where } N \text{ is the length of } \bar{B}\text{ and }\bar{C} \\
& \qquad\text{and } B_k^* \text{ is the complex conjugate of } B_k
\end{aligned}\end{split}\]
Create complex 32-bit BFP vector from real and imaginary parts.
Create a complex 32-bit BFP vector as the sum of a real vector \(\bar B\) and imaginary vector \(\bar{C} i\).
a, b and c must have been initialized (see bfp_complex_s32_init() and bfp_s32_init()), must be the same length. &a->data[0] must be a double-word-aligned address.
Operation Performed
\[\begin{aligned}
& \bar{A} \leftarrow \bar{B} + \bar{C} i
\end{aligned}\]
This function performs a 6-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^4\) (see dct6_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 8-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^4\) (see dct8_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 12-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^7\) (see dct12_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 16-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^7\) (see dct16_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 24-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^{10}\) (see dct24_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 32-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^{10}\) (see dct32_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 48-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^{13}\) (see dct48_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 64-point forward type-II DCT on input vector \(\bar x\), and populates output vector \(\bar y\) with the result. To avoid possible overflow or saturation, output \(\bar y\) is scaled down by a factor of \(2^{13}\) (see dct64_exp).
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 6-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 8-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 12-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 16-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 24-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 32-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 48-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 64-point inverse DCT (same as type-III DCT) on input vector \(\bar x\), and populates output vector \(\bar y\) with the result.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 2-dimensional 8-by-8 type-II DCT on 8-bit input tensor \(\bar x\) (with elements \(x_{rc}\)). Output tensor \(\bar y\) (with elements \(y_{rc}\)) is populated with the result.
This 2D DCT is performed by first applying a 1D 8-point DCT across each row of \(\bar x\), and then applying a 1D 8-point DCT to each column of that intermediate tensor.
The output is scaled by a factor of \(2^{-\mathtt{sat}-8}\). With \(\mathtt{sat}=0\) this scaling is just enough to avoid any possible saturation. If saturation is considered acceptable, or known a priori to not be possible, negative values for \(\mathtt{sat}\) can be used to increase precision on the output.
This operation may be safely performed in-place if x and y point to the same vector.
This function performs a 2-dimensional 8-by-8 type-III (inverse) DCT on 8-bit input tensor \(\bar x\) (with elements \(x_{rc}\)). Output tensor \(\bar y\) (with elements \(y_{rc}\)) is populated with the result.
This 2D DCT is performed by first applying a 1D 8-point DCT across each row of \(\bar x\), and then applying a 1D 8-point DCT to each column of that intermediate tensor.
The output is scaled by a factor of \(2^{-\mathtt{sat}}\). With \(\mathtt{sat}=0\) this scaling is just enough to avoid any possible saturation. If saturation is considered acceptable, or known a priori to not be possible, negative values for \(\mathtt{sat}\) can be used to increase precision on the output.
This operation may be safely performed in-place if x and y point to the same vector.
Let \(\bar x\) be the input to dct6_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct6\_exp\)
Let \(\bar x\) be the input to dct6_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct8_exp\)
Let \(\bar x\) be the input to dct12_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct12_exp\)
Let \(\bar x\) be the input to dct16_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct16_exp\)
Let \(\bar x\) be the input to dct24_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct24_exp\)
Let \(\bar x\) be the input to dct32_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct32_exp\)
Let \(\bar x\) be the input to dct48_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct48\_exp\)
Let \(\bar x\) be the input to dct64_forward() and \(\bar y\) the output. If \(x\_exp\) and \(y\_exp\) are the exponents associated with \(\bar x\) and \(\bar y\) respectively, then the following relation holds: \(y\_exp = x\_exp + dct64\_exp\)
Performs a forward real Discrete Fourier Transform on a real 32-bit sequence.
Performs an \(N\)-point forward real DFT on the real 32-bit BFP vector x, where \(N\) is x->length. The operation is performed in-place, resulting in an \(N/2\)-element complex 32-bit BFP vector.
where \(x[n]\) is the BFP vector initially represented by x, and \(X[f]\) is the DFT of \(x[n]\) represented by the returned pointer.
The exponent, headroom, length and data contents of x are all updated by this function, though x->data will continue to point to the same address.
x->length must be a power of 2, and must be no larger than (1<<MAX_DIT_FFT_LOG2).
This function returns a bfp_complex_s32_t pointer. This points to the same address as . This is intended as a convenience for user code.
Upon completion, the spectrum data is encoded in x->data as specified for real DFTs in spectrum_packing. That is, x->data[f] for 1<=f<(x->length) represent \(X[f]\) for \(1 \le f < (N/2)\) and x->data[0] represents \(X[0] + j X[N/2]\).
Example
// Initialize time domain data with samples.int32_tbuffer[N]={...};bfp_s32_tsamples;bfp_s32_init(&samples,buffer,0,N,1);// Perform the forward DFT{bfp_complex_s32_t*spectrum=bfp_fft_forward_mono(&samples);// `samples` should no longer be used.// Operate on frequency domain data using `spectrum`...// Perform the inverse DFT to go back to time domainbfp_fft_inverse_mono(spectrum);// returns (bfp_s32_t*) which is the address of `samples`}// Use `samples` again to use new time domain data. ...
Parameters:
x – [inout] The BFP vector \(x[n]\) to be DFTed.
Returns:
Address of input BFP vector x, cast as bfp_complex_s32_t*.
Performs an inverse real Discrete Fourier Transform on a complex 32-bit sequence.
Performs an \(N\)-point inverse real DFT on the real 32-bit BFP vector x, where \(N\) is 2*x->length. The operation is performed in-place, resulting in an \(N\)-element real 32-bit BFP vector.
Operation Performed
\[\begin{split}\begin{aligned}
& x[n] = \frac{1}{N}\sum_{f=0}^{N/2} \left( X[f]\cdot e^{j2\pi fn/N} \right) \\
& \text{ for } 0 \le n < N
\end{aligned}\end{split}\]
where \(X[f]\) is the BFP vector initially represented by x, and \(x[n]\) is the IDFT of \(X[f]\) represented by the returned pointer.
The exponent, headroom, length and data contents of x are all updated by this function, though x->data will continue to point to the same address.
x->length must be a power of 2, and must be no larger than (1<<(MAX_DIT_FFT_LOG2-1)).
This function returns a bfp_s32_t pointer. This points to the same address as . This is intended as a convenience for user code.
When calling, the spectrum data must be encoded in x->data as specified for real DFTs in spectrum_packing. That is, x->data[f] for 1<=f<(x->length) represent \(X[f]\) for \(1 \le f < N/2\), and x->data[0] represents \(X[0] + j X[N/2]\).
Example
// Initialize time domain data with samples.
int32_t buffer[N] = { ... };
bfp_s32_t samples;
bfp_s32_init(&samples, buffer, 0, N, 1);
// Perform the forward DFT
{
bfp_complex_s32_t* spectrum = bfp_fft_forward_mono(&samples);
// `samples` should no longer be used.
// Operate on frequency domain data using `spectrum`
...
// Perform the inverse DFT to go back to time domain
bfp_fft_inverse_mono(spectrum); // returns (bfp_s32_t*) which is the address of `samples`
}
// Use `samples` again to use new time domain data.
...
Parameters:
x – [inout] The BFP vector \(X[f]\) to be IDFTed.
Returns:
Address of input BFP vector x, cast as bfp_s32_t*.
Performs a forward complex Discrete Fourier Transform on a complex 32-bit sequence.
Performs an \(N\)-point forward complex DFT on the complex 32-bit BFP vector x, where \(N\) is x->length. The operation is performed in-place.
Operation Performed
\[\begin{split}\begin{aligned}
& X[f] = \sum_{n=0}^{N-1} \left( x[n]\cdot e^{-j2\pi fn/N} \right) \\
& \text{ for } 0 \le f < N
\end{aligned}\end{split}\]
where \(x[n]\) is the BFP vector initially represented by x, and \(X[f]\) is the DFT of \(x[n]\), also represented by x upon completion.
The exponent, headroom and data contents of x are updated by this function. x->data will continue to point to the same address.
x->length ( \(N\)) must be a power of 2, and must be no larger than (1<<MAX_DIT_FFT_LOG2).
Upon completion, the spectrum data is encoded in x as specified in spectrum_packing. That is, x->data[f] for 0<=f<(x->length) represent \(X[f]\) for \(0 \le f < N\).
Example
// Initialize complex time domain data with samples.
complex_s32_t buffer[N] = { ... };
bfp_complex_s32_t vector;
bfp_complex_s32_init(&vector, buffer, 0, N, 1);
// Perform the forward DFT
bfp_fft_forward_mono(&vector);
// Operate on frequency domain data
...
// Perform the inverse DFT to go back to time domain
bfp_fft_inverse_mono(&vector);
// `vector` contains (complex) time-domain data again
...
Performs an inverse complex Discrete Fourier Transform on a complex 32-bit sequence.
Performs an \(N\)-point inverse complex DFT on the complex 32-bit BFP vector x, where \(N\) is x->length. The operation is performed in-place.
Operation Performed
\[\begin{split}\begin{aligned}
& x[n] = \frac{1}{N}\sum_{f=0}^{N-1} \left( X[f]\cdot e^{j2\pi fn/N} \right) \\
& \text{ for } 0 \le f < N
\end{aligned}\end{split}\]
where \(X[f]\) is the BFP vector initially represented by x, and \(x[n]\) is the DFT of \(X[f]\), also represented by x upon completion.
The exponent, headroom and data contents of x are updated by this function. x->data will continue to point to the same address.
x->length must be a power of 2, and must be no larger than (1<<MAX_DIT_FFT_LOG2).
The data initially encoded in x are interpreted as specified in spectrum_packing. That is, x->data[f] for 0<=f<(x->length) represent \(X[f]\) for \(0 \le f < N\).
Example
// Initialize complex time domain data with samples.
complex_s32_t buffer[N] = { ... };
bfp_complex_s32_t vector;
bfp_complex_s32_init(&vector, buffer, 0, N, 1);
// Perform the forward DFT
bfp_fft_forward_mono(&vector);
// Operate on frequency domain data
...
// Perform the inverse DFT to go back to time domain
bfp_fft_inverse_mono(&vector);
// `vector` contains (complex) time-domain data again
...
Performs a forward real Discrete Fourier Transform on a pair of real 32-bit sequences.
Performs an \(N\)-point forward real DFT on the real 32-bit BFP vectors \(\bar a\) and \(\bar b\), where \(N\) is a->length (which must equal b->length). The resulting spectra, \(\bar A\) and \(\bar B\), are placed in a and b. Each spectrum is a \(N/2\)-element complex 32-bit BFP vectors. To access the spectrum, the pointers a and b should be cast to bfp_complex_s32_t* following a call to this function.
Operation Performed
\[\begin{split}\begin{aligned}
& A[f] = \sum_{n=0}^{N-1} \left( a[n]\cdot e^{-j2\pi fn/N} \right) \text{ for } 0 \le f \le N/2 \\
& B[f] = \sum_{n=0}^{N-1} \left( b[n]\cdot e^{-j2\pi fn/N} \right) \text{ for } 0 \le f \le N/2
\end{aligned}\end{split}\]
where \(a[n]\) and \(b[n]\) are the two time-domain sequences represented by input BFP vectors a and b, and \(A[f]\) and \(B[f]\) are the DFT of \(a[n]\) and \(b[n]\) respectively.
a->length ( \(N\)) must be equal to b->length,mustbeapowerof2,andmustbenolargerthan(1<<MAX_DIT_FFT_LOG2)`.
The parameters and are used as both inputs and outputs. To access the result of the FFT, a and b should be cast to bfp_complex_s32_t*. The structs’ metadata (e.g. exp, hr, length) are updated by this function to reflect this change of interpretation. The bfp_s32_t references should be considered corrupted after this call (at least until bfp_fft_inverse_stereo() is called).
The spectrum data is encoded in a->data and b->data as specified for real DFTs in spectrum_packing. That is, a->data[f] for 1<=f<(a->length) represent \(A[f]\) for \(1 \le f < (N/2)\) and a->data[0] represents \(A[0] + j A[N/2]\). Likewise for the encoding of b->data.
This function requires a scratch buffer large enough to contain \(N\)complex_s32_t elements.
Deprecated:
Example
// Initialize time domain data with samples.
int32_t bufferA[N] = { ... };
int32_t bufferB[N] = { ... };
complex_s32_t scratch[N]; // scratch buffer -- contents don't matter
bfp_s32_t channel_A, channel_B;
bfp_s32_init(&channel_A, buffer, 0, N, 1);
bfp_s32_init(&channel_B, buffer, 0, N, 1);
// Perform the forward DFT
bfp_fft_forward_stereo(&channel_A, &channel_B, scratch);
// channel_A and channel_B should now be considered clobbered as the structs are now
// effectively bfp_complex_s32_t
bfp_complex_s32_t* chanA = (bfp_complex_s32_t*) &channel_A;
bfp_complex_s32_t* chanB = (bfp_complex_s32_t*) &channel_B;
// Operate on frequency domain data using `chanA` and `chanB`
...
// Perform the inverse DFT to go back to time domain
bfp_fft_inverse_stereo(&chanA, &chanB, scratch);
// Use channel_A and channel_B again to use new time domain data.
...
// Suppress this from generated documentation for the time being //
Note
Use of this function is not currently recommended. It functions correctly, but a recent change in this library’s API (namely, dropping support for channel-pair vectors) means this function is no more computationally efficient than calling bfp_fft_forward_mono() on each input vector separately. Additionally, this function currently requires a scratch buffer, whereas the mono FFT does not.
Parameters:
a – [inout] [Input] Time-domain BFP vector \(\bar a\). [Output] Frequency domain BFP vector \(\bar A\)
b – [inout] [Input] Time-domain BFP vector \(\bar b\). [Output] Frequency domain BFP vector \(\bar B\)
scratch – Scratch buffer of at least a->lengthcomplex_s32_t elements
Performs an inverse real Discrete Fourier Transform on a pair of complex 32-bit sequences.
Performs an \(N\)-point inverse real DFT on the 32-bit complex BFP vectors \(\bar A\) and \(\bar B\) (A_fft and B_fft respectively), where \(N\) is A_fft->length . The resulting real signals, \(\bar a\) and \(\bar b\), are placed in A_fft and B_fft. Each time-domain result is a \(N/2\)-element real 32-bit BFP vectors. To access the spectrum, the pointers A_fft and B_fft should be cast to bfp_s32_t* following a call to this function.
Operation Performed
\[\begin{split}\begin{aligned}
& a[n] = \frac{1}{N}\sum_{f=0}^{N/2-1} \left( A[f]\cdot e^{j2\pi fn/N} \right) \text{ for } 0 \le n < N \\
& b[n] = \frac{1}{N}\sum_{f=0}^{N/2-1} \left( B[f]\cdot e^{j2\pi fn/N} \right) \text{ for } 0 \le n < N
\end{aligned}\end{split}\]
where \(A[f]\) and \(B[f]\) are the frequency spectra represented by BFP vectors A_fft and B_fft, and \(a[n]\) and \(b[n]\) are the IDFT of \(A[f]\) and \(B[f]\).
A_fft->length ( \(N\)) must be a power of 2, and must be no larger than (1<<(MAX_DIT_FFT_LOG2-1)).
The parameters and are used as both inputs and outputs. To access the result of the IFFT, A_fft and B_fft should be cast to bfp_s32_t*. The structs’ metadata (e.g. exp, hr, length) are updated by this function to reflect this change of interpretation. The bfp_complex_s32_t references should be considered corrupted after this call.
The spectrum data encoded in A_fft->data and A_fft->data are interpreted as specified for real DFTs in spectrum_packing. That is, A_fft->data[f] for 1<=f<(a->length) represent \(A[f]\) for \(1 \le f < (N/2)\) and A_fft->data[0] represents \(A[0] + j
A[N/2]\). Likewise for the encoding of B_fft->data.
This function requires a scratch buffer large enough to contain \(2N\)complex_s32_t elements.
Deprecated:
Example
// Initialize time domain data with samples.
int32_t bufferA[N] = { ... };
int32_t bufferB[N] = { ... };
complex_s32_t scratch[N]; // scratch buffer -- contents don't matter
bfp_s32_t channel_A, channel_B;
bfp_s32_init(&channel_A, buffer, 0, N, 1);
bfp_s32_init(&channel_B, buffer, 0, N, 1);
// Perform the forward DFT
bfp_fft_forward_stereo(&channel_A, &channel_B, scratch);
// channel_A and channel_B should now be considered clobbered as the structs are now
// effectively bfp_complex_s32_t
bfp_complex_s32_t* chanA = (bfp_complex_s32_t*) &channel_A;
bfp_complex_s32_t* chanB = (bfp_complex_s32_t*) &channel_B;
// Operate on frequency domain data using `chanA` and `chanB`
...
// Perform the inverse DFT to go back to time domain
bfp_fft_inverse_stereo(&chanA, &chanB, scratch);
// Use channel_A and channel_B again to use new time domain data.
...
// Suppress this from generated documentation for the time being //
Note
Use of this function is not currently recommended. It functions correctly, but a recent change in this library’s API (namely, dropping support for channel-pair vectors) means this function is no more computationally efficient than calling bfp_fft_forward_mono() on each input vector separately. Additionally, this function currently requires a scratch buffer, whereas the mono FFT does not.
The DFT of a real signal is periodic with period FFT_N (the FFT length) and has a complex conjugate symmetry about index 0. These two properties guarantee that the imaginary part of both the DC component (index 0) and the Nyquist component (index FFT_N/2) of the spectrum are zero. To compute the forward FFT in-place, bfp_fft_forward_mono() packs the real part of the Nyquist rate component of the output spectrum into the imaginary part of the DC component.
This may be undesirable when operating on the signal’s complex spectrum. Use this function to unpack the Nyquist component. This function will also adjust the BFP vector’s length to reflect this unpacking.
NOTE: If you intend to unpack the spectrum using this function, the buffer for the time-domain BFP vector must have length FFT_N+2, rather than FFT_N (int32_t elements), but these should NOT be reflected in the time-domain BFP vector’s length field.
Compute a forward DFT using the decimation-in-time FFT algorithm.
This function computes the N-point forward DFT of a complex input signal using the decimation-in-time FFT algorithm. The result is computed in-place.
Operation Performed
\[\begin{split}\begin{aligned}
& X[f] = \frac{1}{2^{\alpha}} \sum_{n=0}^{N-1} \left( x[n]\cdot e^{-j2\pi fn/N} \right) \\
& \text{ for } 0 \le f < N
\end{aligned}\end{split}\]
x[] is interpreted to be a block floating-point vector with shared exponent *exp and with *hr bits of headroom initially in x[]. During computation, this function monitors the headroom of the data and compensates to avoid overflows and underflows by bit-shifting the data up or down as appropriate. In the equation above, \(\alpha\)
represents the (net) number of bits that the data was right-shifted by.
Upon completion, *hr is updated with the final headroom in x[], and the exponent *exp is incremented by \(\alpha\).
Note
In order to guarantee that saturation will not occur, x[] must have an initial headroom of at least 2 bits.
Parameters:
x – [inout] The N-element complex input vector to be transformed.
N – [in] The size of the DFT to be performed.
hr – [inout] Pointer to the initial headroom in x[].
exp – [inout] Pointer to the initial exponent associated with x[].
Compute an inverse DFT using the decimation-in-time IFFT algorithm.
This function computes the N-point inverse DFT of a complex spectrum using the decimation-in-time IFFT algorithm. The result is computed in-place.
Operation Performed
\[\begin{split}\begin{aligned}
& x[n] = \frac{1}{2^{\alpha}} \sum_{f=0}^{N-1} \left( X[f]\cdot e^{j2\pi fn/N} \right) \\
& \text{ for } 0 \le n < N
\end{aligned}\end{split}\]
x[] is interpreted to be a block floating-point vector with shared exponent *exp and with *hr bits of headroom initially in x[]. During computation, this function monitors the headroom of the data and compensates to avoid overflows and underflows by bit-shifting the data up or down as appropriate. In the equation above, \(\alpha\) represents the (net) number of bits that the data was right-shifted by.
Upon completion, *hr is updated with the final headroom in x[], and the exponent *exp is incremented by \(\alpha - log_2(N)\).
Note
In order to guarantee that saturation will not occur, x[] must have an initial headroom of at least 2 bits.
Parameters:
x – [inout] The N-element complex input vector to be transformed.
N – [in] The size of the inverse DFT to be performed.
hr – [inout] Pointer to the initial headroom in x[].
exp – [inout] Pointer to the initial exponent associated with x[].
Compute a forward DFT using the decimation-in-frequency FFT algorithm.
This function computes the N-point forward DFT of a complex input signal using the decimation-in-frequency FFT algorithm. The result is computed in-place.
Operation Performed
\[\begin{split}\begin{aligned}
& X[f] = \frac{1}{2^{\alpha}} \sum_{n=0}^{N-1} \left( x[n]\cdot e^{-j2\pi fn/N} \right) \\
& \text{ for } 0 \le f < N
\end{aligned}\end{split}\]
x[] is interpreted to be a block floating-point vector with shared exponent *exp and with *hr bits of headroom initially in x[]. During computation, this function monitors the headroom of the data and compensates to avoid overflows and underflows by bit-shifting the data up or down as appropriate. In the equation above, \(\alpha\) represents the (net) number of bits that the data was right-shifted by.
Upon completion, *hr is updated with the final headroom in x[], and the exponent *exp is incremented by \(\alpha\).
Note
In order to guarantee that saturation will not occur, x[] must have an initial headroom of at least 2 bits.
Parameters:
x – [inout] The N-element complex input vector to be transformed.
N – [in] The size of the DFT to be performed.
hr – [inout] Pointer to the initial headroom in x[].
exp – [inout] Pointer to the initial exponent associated with x[].
Compute an inverse DFT using the decimation-in-frequency IFFT algorithm.
This function computes the N-point inverse DFT of a complex spectrum using the decimation-in-frequency IFFT algorithm. The result is computed in-place.
Operation Performed
\[\begin{split}\begin{aligned}
& x[n] = \frac{1}{2^{\alpha}} \sum_{f=0}^{N-1} \left( X[f]\cdot e^{j2\pi fn/N} \right) \\
& \text{ for } 0 \le n < N
\end{aligned}\end{split}\]
x[] is interpreted to be a block floating-point vector with shared exponent *exp and with *hr bits of headroom initially in x[]. During computation, this function monitors the headroom of the data and compensates to avoid overflows and underflows by bit-shifting the data up or down as appropriate. In the equation above, \(\alpha\) represents the (net) number of bits that the data was right-shifted by.
Upon completion, *hr is updated with the final headroom in x[], and the exponent *exp is incremented by \(\alpha - log_2(N)\).
Note
In order to guarantee that saturation will not occur, x[] must have an initial headroom of at least 2 bits.
Parameters:
x – [inout] The N-element complex input vector to be transformed.
N – [in] The size of the inverse DFT to be performed.
hr – [inout] Pointer to the initial headroom in x[].
exp – [inout] Pointer to the initial exponent associated with x[].
Add a new input sample to a 32-bit FIR filter without processing an output sample.
This function adds a new input sample to filter’s state without computing a new output sample. This is a constant- time operation and can be used to quickly pre-load a filter with sample data.
See filter_fir_s32_t for more information about FIR filters and their operation.
The new input sample new_sample is added to this filter’s state, and a new output sample is computed and returned as specified in filter_biquad_s32_t.
This function processes a single filter block containing (up to) 8 biquad filter sections. For biquad filters containing 2 or more filter blocks (more than 8 biquad filter sections), see filter_biquads_s32().
This function implements a 32-bit Biquad filter with saturation.
Works the same as filter_biquad_s32(), but saturates the output to the symmetric 32-bit range at the cost of several compute cycles. The cost will depend on the number of biquads and the target architecture.
Move most of this information out to higher-level documentation
Filter Model
This struct represents an N-tap 32-bit discrete-time FIR Filter.
At each time step, the FIR filter consumes a single 32-bit input sample and produces a single 32-bit output sample.
To process a new input sample and compute a new output sample, use filter_fir_s32(). To add a new input sample to the filter without computing a new output sample, use filter_fir_s32_add_sample().
An N-tap FIR filter contains N 32-bit cofficients (pointed to by coef) and N words of state data (pointed to by state. The state data is a vector of the N most recent input samples. When processing a new input sample at time step t, x[t] is the new input sample, x[t-1] is the previous input sample, and so on, up to x[t-(N-1)], which is the oldest input considered when computing the new output sample (see note 1 below). The coefficients form a vector b[], where b[k] is the coefficient by which the kth oldest input sample is multiplied. There is an additional parameter shift which scales the output as described below. Both the coefficients and shift are considered to be constants which do not change after initialization (although nothing should break if they are changed to new valid values).
At time step t, the output sample y[t] is computed based on the inner product (i.e. sum of element-wise products) of the coefficients and state data as follows (a more detailed description is below):
Importantly, all three of the operators above (addition, multiplication and the rightwards bit-shift) have slightly ideosyncratic meanings.
The products have a built-in rounding arithmetic right-shift of 30 bits, where ties round toward positive infinity. This is a hardware feature which allows for longer filters (larger N) without sacrificing coefficient precision. These element-wise products accumulate into 8 40-bit accumulators saturate the sums at symmetric 40-bit bounds (see saturation). The order in which the taps are accumulated is unspecified (see note 2 below).
After each tap has been accumulated, the 8 accumulators are then added together to get a 64-bit penultimate result (with 43 useful bits). Finally, an unsigned rounding arithmetic right-shift of shift bits is applied to the 64-bit sum, and the final result is saturated to the symmetric 32-bit range (-INT32_MAX to INT32_MAX inclusive).
Below is a more detailed description of the operations performed (not including the saturation logic applied by the accumulators).
\[\begin{split}
& y[t] = sat_{32} \left(
round \left(
\left(
\sum_{k=0}^{N-1} round(x[t-k] \cdot b[k] \cdot 2^{-30})
\right) \cdot 2^{-shift}
\right)
\right) \\
& \qquad\text{where } sat_{32}() \text{ saturates to } \pm(2^{31}-1) \\
& \qquad\text{ and } round() \text{ rounds to the nearest integer, with ties rounding towards } +\!\infty
\end{split}\]
Operations
Initialize: A filter_fir_s32_t filter is initialized with a call to filter_fir_s32_init(). The caller supplies information about the filter, including the number of taps and pointers the coefficients and a state buffer. It is typically recommended that the state buffer be cleared to all 0s before initializing.
Add Sample: To add a new input sample without computing a new output sample, use filter_fir_s32_add_sample(). This is a constant-time operation which does not depend on the number of filter taps. This may be useful in some situations, for example, to quickly pre-load the filter’s state buffer with multiple samples, without incurring the cost of computing an output with each added sample.
Process Sample: To process a new input sample and produce a new output sample, use filter_fir_s32().
Fields
After initialization via filter_fir_s32_init(), the contents of the filter_fir_s32_t struct are considered to be opaque, and may change between major versions. In general, user code should not need to access its members.
num_taps is the order of the filter, or the number of taps. It is also the (minimum) size of the buffers to which coef and state point, in elements (where each element is 4 bytes). The time required to process an input sample and produce an output sample is approximately linear in num_taps (see Performance below).
head is the index into state at which the next sample will be added.
shift is the unsigned arithmetic rounding saturating right-shift applied to internal accumulator to get a final output.
coef is a pointer to a buffer (supplied by the user at initialization) containing the tap coefficients. The coefficients are stored in forward order, with lower indices corresponding to newer samples. coef[0], then, corresponds to b[0], coef[1] to b[1], and so on. None of the functions which operate on filter_fir_s32_t structs in this library will modify the contents of the buffer to which coef points. This buffer must be at least num_taps words long.
state is a pointer to a buffer (supplied by the user at initialization) containing the state data — a history of the num_taps most recent input samples. state is used in a circular fashion with head indicating the index at which the next sample will be inserted.
Performance
More work remains to fully characterize the time performance of this FIR filter, but asymptotically (i.e. with a large number of filter taps) processing a new input sample to produce a new output sample takes approximately 3 thread cycles per 8 filter taps.
That assumes that both the coefficients (pointed to by coef) and state buffer (pointed to by state) are stored directly in SRAM.
Coefficient Scaling
Suppose you’re starting with a floating-point FIR filter model with coefficients B[k] which operates on a sequence of 32-bit integer input samples x[t] to get a result Y[t] where
Because of the 30-bit right-shift and the right-shift of the final accumulator by shift bits, the coefficients b[k] to use with this library can be thought of as fixed-point values with 30+shift fractional bits.
The floating-point coefficients B[k] can then be naively converted to fixed-point coefficients b[k]
shift=0b[k]=(int32_t)round(ldexp(B[k],30)
After this, any further doubling of the coefficients can be compensated for without changing the overall gain by incrementing shift.
To maximize precision, you’ll typically want shift to be as large as possible while in the worst case to be considered neither saturates the internal accumulator (which, for safety, should generally be assumed to be 42 bits), nor saturates the final 32-bit output when shift is applied.
The details of this depend on various details, such as your filter’s gain and the statistics of the sequence x[t] (e.g. any headroom x[t] is known a priori to have).
Filter Conversion
This library includes a python script which converts existing floating-point FIR filter coefficients into a suitable representation and generates code for easily initializing and executing the filter. See Note: Digital Filter Conversion for more.
Usage Example
#define N 256 // Tap count#define B_VAL ldexp(1.0/N, 30+7) // Value for (all) coefficientsconstint32_tb[TAPS]=// The filter coefficients{B_VAL,B_VAL,B_VAL,...,B_VAL};constright_shift_tshift=7;// The (unsigned) right-shift applied to the final accumulatorint32_tstate_buff[TAPS]={0};// Filter state buffer, initialized to 0'sfilter_fir_s32_tfilter;// The filter struct#define SAMPLE_COUNT 1024int32_tx[SAMPLE_COUNT]={...};// Some sequence of input samples// Initializefilter_fir_s32_init(&filter,state_buff,N,b,shift);// Just add the first 64 without processing output samples. (not necessary)for(unsignedi=0;i<64;i++)filter_fir_s32_add_sample(&filter,x[i]);// Process the rest, generating a sequence of filtered output samplesint32_ty[SAMPLE_COUNT]={0};//Output samples (first 64 never get updated here)for(unsignedi=64;i<SAMPLE_COUNT;i++)y[i]=filter_fir_s32(&filter,x[i]);// Do something with output sequence...
This example creates a simple 256-tap filter which averages the most recent 256 samples.
Each b[k] is \(2^{29}\), and the final accumulator is right-shifted 7 bits. In the worst case, all input samples are \(-2^{31}\). In that case, the final accumulator value is \( 256 \cdot (2^{29} \cdot -2^{31} \cdot 2^{-30}) = -2^{38} \), well below the saturation limit of the accumulator. After shift is applied, that becomes \(-2^{38} \cdot 2^{-7} =
-2^{31}\). Finally, the 32-bit symmetric saturation logic is applied, making the final output value \(-2^{31}+1\).
Notes
state is a circular buffer, and so the index of x[t] within state changes with each input sample. The state field of this struct is considered to be opaque — its exact usage may change between versions.
Ordinarily integer sums are associative, so the order in which elements are added added does not affect the final result. The sum that the FIR filters use, however, is saturating, with the saturation logic being applied throughout the sum. This saturation is a hard non-linearity and is not associative. The details of exactly when each tap is accumulated and into which accumulator are complicated and subject to change. It is best to construct a filter such that no ordering of the taps will saturate the accumulators.
This struct represents an N-tap 16-bit discrete-time FIR Filter.
At each time step, the FIR filter consumes a single 16-bit input sample and produces a single 16-bit output sample.
To process a new input sample and compute a new output sample, use filter_fir_s16(). To add a new input sample to the filter without computing a new output sample, use filter_fir_s16_add_sample().
An N-tap FIR filter contains N 16-bit cofficients (pointed to by coef) and Nint16_ts of state data (pointed to by state. The state data is a vector of the N most recent input samples. When processing a new input sample at time step t, x[t] is the new input sample, x[t-1] is the previous input sample, and so on, up to x[t-(N-1)], which is the oldest input considered when computing the new output sample (see note 1 below). The coefficients form a vector b[], where b[k] is the coefficient by which the kth oldest input sample is multiplied. There is an additional parameter shift which scales the output as described below. Both the coefficients and shift are considered to be constants which do not change after initialization (although nothing should break if they are changed to new valid values).
At time step t, the output sample y[t] is computed based on the inner product (i.e. sum of element-wise products) of the coefficients and state data as follows (a more detailed description is below):
Unlike the 32-bit FIR filters (see filter_fir_s16_t), the products x[t-k]*b[k] are the raw 32-bit products of the 16-bit elements. These element-wise products accumulate into a 32-bit accumulator which saturates the sums at symmetric 32-bit bounds (see saturation).
After all taps have been accumulated, a rounding arithmetic right-shift of shift bits is applied to the 64-bit sum, and the final result is saturated to the symmetric 16-bit range (the open interval \((-2^{15}, 2^{15})\)).
Below is a more detailed description of the operations performed (not including the saturation logic applied by the accumulators).
\[\begin{split}
& y[t] = sat_{16} \left(
round \left(
\left(
\sum_{k=0}^{N-1} round(x[t-k] \cdot b[k])
\right) \cdot 2^{-shift}
\right)
\right) \\
& \qquad\text{where } sat_{32}() \text{ saturates to } \pm(2^{15}-1) \\
& \qquad\text{ and } round() \text{ rounds to the nearest integer, with ties rounding towards } +\!\infty
\end{split}\]
Operations
Initialize: A filter_fir_s16_t filter is initialized with a call to filter_fir_s16_init(). The caller supplies information about the filter, including the number of taps and pointers the coefficients and a state buffer. It is typically recommended that the state buffer be cleared to all 0s before initializing.
Add Sample: To add a new input sample without computing a new output sample, use filter_fir_s16_add_sample(). Unlike filter_fir_s32_add_sample(), this is not a constant-time operation, and does depend on the number of filter taps. Nevertheless, this is faster than computing output samples, and may be useful in some situations, for example, to moer quickly pre-load the filter’s state buffer with multiple samples, without incurring the cost of computing an output with each added sample.
Process Sample: To process a new input sample and produce a new output sample, use filter_fir_s16().
Fields
After initialization via filter_fir_s16_init(), the contents of the filter_fir_s16_t struct are considered to be opaque, and may change between major versions. In general, user code should not need to access its members.
num_taps is the order of the filter, or the number of taps. It is also the (minimum) size of the buffers to which coef and state point, in elements (where each element is 2 bytes). The time required to process an input sample and produce an output sample is approximately linear in num_taps (see Performance below).
shift is the unsigned arithmetic rounding saturating right-shift applied to internal accumulator to get a final output.
coef is a pointer to a buffer (supplied by the user at initialization) containing the tap coefficients. The coefficients are stored in forward order, with lower indices corresponding to newer samples. coef[0], then, corresponds to b[0], coef[1] to b[1], and so on. None of the functions which operate on filter_fir_s16_t structs in this library will modify the contents of the buffer to which coef points. This buffer must be at least num_taps elements long, and must begin at a word-aligned address.
state is a pointer to a buffer (supplied by the user at initialization) containing the state data — a history of the num_taps most recent input samples. state must begin at a word-aligned address.
Coefficient Scaling
Filter Conversion
This library includes a python script which converts existing floating-point FIR filter coefficients into a suitable representation and generates code for easily initializing and executing the filter. See Note: Digital Filter Conversion for more.
This library includes a python script which converts existing floating-point cascaed biquad filter coefficients into a suitable representation and generates code for easily initializing and executing the filter. See Note: Digital Filter Conversion for more.
Convert a 16-bit floating-point scalar to a 32-bit floating-point scalar.
Converts a 16-bit floating-point scalar, represented by the 16-bit mantissa b and exponent b_exp, into a 32-bit floating-point scalar, represented by the 32-bit returned mantissa and output exponent a_exp.
remove_hr, if nonzero, indicates that the output mantissa should have no headroom. Otherwise, the output mantissa will be the same as the input mantissa.
Parameters:
a_exp – [out] Output exponent
b – [in] 16-bit input mantissa
b_exp – [in] Input exponent
remove_hr – [in] Whether to remove headroom in output
Convert a 32-bit floating-point scalar to a 16-bit floating-point scalar.
Converts a 32-bit floating-point scalar, represented by the 32-bit mantissa b and exponent b_exp, into a 16-bit floating-point scalar, represented by the 16-bit returned mantissa and output exponent a_exp.
Compute the square root of a 32-bit floating-point scalar.
b and b_exp together represent the input \(b \cdot 2^{b\_exp}\). Likewise, a and a_exp together represent the result \(a \cdot 2^{a\_exp}\).
depth indicates the number of MSb’s which will be calculated. Smaller values here will execute more quickly at the cost of reduced precision. The maximum valid value for depth is S32_SQRT_MAX_DEPTH.
Operation Performed
\[\begin{aligned}
a \cdot 2^{a\_exp} \leftarrow \sqrt{\left( b \cdot 2^{b\_exp} \right)}
\end{aligned}\]
Parameters:
a_exp – [out] Output exponent \(a\_exp\)
b – [in] Input mantissa \(b\)
b_exp – [in] Input exponent \(b\_exp\)
depth – [in] Number of most significant bits to calculate
Convert angle from radians to a modified binary representation.
Some trig functions, such as sbrad_sin(), rather than taking an angle specified in radians (e.g. radian_q24_t), require their argument to be a modified representation of the angle, as an sbrad_t. The modified binary representation takes into account various properies of the \(sin(\theta)\) function to simplify certain operations.
For any angle \(\theta\) there is a unique angle \(\alpha\) where \(-1\le\alpha\le1\) and \(sin(\frac{\pi}{2}\alpha) = sin(\theta)\). This function essentially just maps the input angle \(\theta\) onto the corresponding angle \(\alpha\) in that region and returns the result in a Q1.31 format.
In this library, the unit of the resulting angle \(\alpha\) is referred to as an ‘sbrad’. ‘brad’ because \(\alpha\) is a kind of binary angular measurement, and ‘s’ because the symmetries of \(sin(\theta)\) are what’s being accounted for.
Parameters:
theta – [in] Input angle \(\theta\), in radians (Q8.24)
This function computes \(tan(\theta)\). The result is returned as a float_s32_t containing a mantissa and exponent.
The value of \(tan(\theta)\) is considered undefined where \(theta=\frac{\pi}{2}+k\pi\) for any integer \(k\). An exception will be raised if \(\theta\) meets this condition.
Operation Performed
\[\begin{aligned}
& tan(\theta)
\end{aligned}\]
Parameters:
theta – [in] Input angle \(\theta\), in radians (Q8.24)
Evaluate the logistic function at the specified point.
This function computes the value of the logistic function \(y =\frac{1}{1+e^{-x}}\). This is a sigmoidal curve bounded below by \(y = 0\) and above by \(y = 1\).
The input \(x\) and output \(y\) are both Q8.24 fixed-point values.
If speed is greatly preferred to precision, q24_logistic_fast() can be used instead.
Operation Performed
\[\begin{aligned}
& y \leftarrow \frac{1}{1+e^{-x}}
\end{aligned}\]
Evaluate the logistic function at the specified point.
This function computes the value of the logistic function \(y =\frac{1}{1+e^{-x}}\). This is a sigmoidal curve bounded below by \(y = 0\) and above by \(y = 1\).
The input \(x\) and output \(y\) are both Q8.24 fixed-point values.
This implementation trades off precision for speed, approximating results in a piece-wise linear manner. If a precise result is desired, q24_logistic() should be used instead.
Operation Performed
\[\begin{aligned}
& y \leftarrow \frac{1}{1+e^{-x}}
\end{aligned}\]
This function computes the first \(N\) powers (starting with \(0\)) of the Q2.30 input \(b\). The results are output as \(\bar a\), also in Q2.30 format.
This function populates the elements of output vector \(\bar a\) with the odd powers of input \(b\). The first count odd powers of \(b\) are output. The highest power output will be \(2\cdot\mathtt{count}-1\).
The 64-bit product of each multiplication is right-shifted by shr bits and truncated to the 32 least significant bits. If \(b\) is a fixed-point value with shr fractional bits, then each \(a_k\) will have the same Q-format as input \(b\). shr must be non-negative.
This function neither rounds nor saturates results. It is up to the user to ensure overflows are avoided.
Typical use-case is computing a power series of a function with odd symmetry.
Get a representation of the input \(x\) in normalized form A.
This function is used internally to transform a float value into a representation required for certain purposes.
In particular, this function behaves much like frexpf(), where it is guaranteed that the returned value \(a\) is either \(0\) or that \(0.5 \le \left| a \right| < 1.0\), and the output exponent \(p\) is such that \(x = a \cdot 2^{p}\).
In anticipation that future work may require alternative “normalized” representations, this form is being defined here as form A.
Parameters:
p – [in] Output exponent \(p\)
x – [in] Input value \(x\)
Throws ET_ARITHMETIC:
Raised if \(x\) or any element of \(\bar b\) is infinite or NaN.
This function updates an exponential moving average by applying a single new sample. \(x\) is taken as the previous EMA state, with \(y\) as the new sample. The EMA coefficient \(\alpha\) is applied to the term including \(x\).
coef is a fixed-point value in a UQ2.30 format (i.e. has an implied exponent of \(-30\)), and should be in the range \(0 \leq \alpha \leq 1\).
Operation Performed
\[\begin{aligned}
& a \leftarrow \alpha \cdot x + (1 - \alpha) \cdot y
\end{aligned}\]
Parameters:
x – [in] Input operand \(x\)
y – [in] Input operand \(y\)
coef – [in] EMA coefficient \(\alpha\) encoded in UQ2.30 format
This function computes the square root of \(x\). The result, \(a\) is returned.
The precision with which \(a\) is computed is configurable via the XMATH_BFP_SQRT_DEPTH_S32 configuration parameter. It indicates the number of most significant bits to be calculated.
Operation Performed
\[\begin{aligned}
& a \leftarrow \sqrt{x}
\end{aligned}\]
This function computes \(e^x\) for real input \(x\).
If \(x\) is known to be in the interval \(\left[-0.5,0.5\right]\), q30_exp_small() (which is used internally by this function) may be used instead for a speed boost.
Operation Performed
\[\begin{aligned}
& y \leftarrow e^x
\end{aligned}\]
This function reports the size of the number as \(a\), the number of bits required to store unsigned integer \(N\). This is equivalent to \( ceil\left(log_2\left(N\right)\right) \).
N is the input \(N\).
Operation Performed
\[\begin{split}\begin{aligned}
a \leftarrow
\begin{cases}
0 & N = 0 \\
\lceil log_2\left( N \right) \rceil & otherwise
\end{cases}
\end{aligned}\end{split}\]
Convert a 64-bit floating-point scalar to a 32-bit floating-point scalar.
Converts a 64-bit floating-point scalar, represented by the 64-bit mantissa b and exponent b_exp, into a 32-bit floating-point scalar, represented by the 32-bit returned mantissa and output exponent a_exp.
The tables below list the functions of the vector API. The “EW” column indicates whether the
operation acts element-wise.
The “Signature” column is intended as a hint which quickly conveys the kind of the conceptual inputs
to and outputs from the operation. The signatures are only intended to convey how many (conceptual)
inputs and outputs there are, and their dimensionality.
The functions themselves will typically take more arguments than these signatures indicate. For
example, most functions take vector lengths as input, and many take shift values which are used to
control growth of element bit-depth. Check the function’s full documentation to get more detailed
information.
The following symbols are used in the signatures:
Symbol
Description
\(\mathbb{S}\)
A scalar input or output value.
\(\mathbb{V}\)
A vector-valued input or output.
\(\mathbb{M}\)
A matrix-valued input or output.
\(\varnothing\)
Placeholder indicating no input or output.
For example, the operation signature \((\mathbb{V \times V \times S}) \to \mathbb{V}\) indicates
the operation takes two vector inputs and a scalar input, and the output is a vector.
\(\mathbb{V \times V \times V}\)\(\to \mathbb{V}\)
Note that several of the functions below take vectors of the split_acc_s32_t type. This
is a 32-bit vector type used for accumulating results of 8- or 16-bit operations in a manner
optimized for the XS3 VPU.
Determine whether each element of a signed 8-bit input vector are negative.
Each element \(a_k\) of 8-bit output vector \(\bar a\) is set to 1 if the corresponding element \(b_k\) of 8-bit input vector \(\bar b\) is negative, and is set to 0 otherwise.
a[] represents the 8-bit output vector \(\bar a\), with the element a[k] representing \(a_k\).
b[] represents the 8-bit input vector \(\bar b\), with the element b[k] representing \(b_k\).
Multiply-accumulate an 8-bit matrix by an 8-bit vector into 32-bit accumulators.
This function multiplies an 8-bit \(M \times N\) matrix \(\bar W\) by an 8-bit \(N\)-element column vector \(\bar v\) and adds it to the 32-bit accumulator vector \(\bar a\).
accumulators is the output vector \(\bar a\) to which the product \(\bar W\times\bar v\) is accumulated. Note that the accumulators are encoded in a format native to the xcore VPU. To initialize the accumulator vector to zeros, just zero the memory.
matrix is the matrix \(\bar W\).
input_vect is the vector \(\bar v\).
matrix and input_vect must both begin at a word-aligned offsets.
M_rows and N_rows are the dimensions \(M\) and \(N\) of matrix \(\bar W\). \(M\) must be a multiple of 16, and \(N\) must be a multiple of 32.
The result of this multiplication is exact, so long as saturation does not occur.
Parameters:
accumulators – [inout] The accumulator vector \(\bar a\)
matrix – [in] The weight matrix \(\bar W\)
input_vect – [in] The input vector \(\bar v\)
M_rows – [in] The number of rows \(M\) in matrix \(\bar W\)
N_cols – [in] The number of columns \(N\) in matrix \(\bar W\)
Compute the element-wise absolute value of a 16-bit vector.
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
Compute the sum of the absolute values of elements of a 16-bit vector.
b[] represents the 16-bit vector \(\bar b\). b[] must begin at a word-aligned address.
length is the number of elements in \(\bar b\).
Operation Performed
\[\begin{aligned}
a \leftarrow \sum_{k=0}^{length-1} \left| b_k \right|
\end{aligned}\]
Block Floating-Point
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 32-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
a[], b[] and c[] represent the 16-bit vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp + b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_s16_add_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
a[], b[] represent the 16-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
c is the scalar \(c\) to be added to each element of \(\bar b\).
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shifts applied to each element of \(\bar b\).
If elements of \(\bar b\) are the mantissas of BFP vector \( \bar{b} \cdot 2^{b\_exp} \), and \(c\) is the mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp + b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_s16_add_scalar_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Clamp the elements of a 16-bit vector to a specified range.
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
lower_bound and upper_bound are the lower and upper bounds of the clipping range respectively. These bounds are checked for each element of \(\bar b\) only afterb_shr is applied.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\)before being compared to the upper and lower bounds.
If \(\bar b\) are the mantissas for a BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the exponent \(a\_exp\) of the output BFP vector \(\bar{a} \cdot 2^{a\_exp}\) is given by \(a\_exp = b\_exp + b\_shr\).
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
lower_bound – [in] Lower bound of clipping range
upper_bound – [in] Upper bound of clipping range
b_shr – [in] Arithmetic right-shift applied to elements of \(\bar b\) prior to clipping
b[] and c[] represent the 32-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address.
length is the number of elements in each of the vectors.
Operation Performed
\[\begin{aligned}
a \leftarrow \sum_{k=0}^{length-1}\left( b_k \cdot c_k \right)
\end{aligned}\]
Block Floating-Point
If \(\bar b\) and \(\bar c\) are the mantissas of the BFP vectors \( \bar{b} \cdot 2^{b\_exp}\) and \(\bar{c}\cdot 2^{c\_exp}\), then result \(a\) is the mantissa of the result \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp\).
If needed, the bit-depth of \(a\) can then be reduced to 16 or 32 bits to get a new result \(a' \cdot 2^{a\_exp'}\) where \(a' = a \cdot 2^{-a\_shr}\) and \(a\_exp' = a\_exp +
a\_shr\).
Notes
The sum \(a\) is accumulated simultaneously into 16 48-bit accumulators which are summed together at the final step. So long as length is less than roughly 2 million, no overflow or saturation of the resulting sum is possible.
Parameters:
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar b\) and \(\bar c\)
Calculate the energy (sum of squares of elements) of a 16-bit vector.
b[] represents the 16-bit vector \(\bar b\). b[] must begin at a word-aligned address.
length is the number of elements in \(\bar b\).
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\). b_shr should be chosen to avoid the possibility of saturation. See the note below.
If \(\bar b\) are the mantissas of the BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then floating-point result is \(a \cdot 2^{a\_exp}\), where the 32-bit mantissa \(a\) is returned by this function, and \(a\_exp = 2 \cdot (b\_exp + b\_shr) \).
Additional Details
If \(\bar b\) has \(b\_hr\) bits of headroom, then each product \((b_k')^2\) can be a maximum of \( 2^{30 - 2 \cdot (b\_hr + b\_shr)}\). So long as length is less than \(1 +
2\cdot (b\_hr + b\_shr) \), such errors should not be possible. Each increase of \(b\_shr\) by \(1\) doubles the number of elements that can be summed without risk of overflow.
If the caller’s mantissa vector is longer than that, the full result can be found by calling this function multiple times for partial results on sub-sequences of the input, and adding the results in user code.
In many situations the caller may have a priori knowledge that saturation is impossible (or very nearly so), in which case this guideline may be disregarded. However, such situations are application-specific and are well beyond the scope of this documentation, and as such are left to the user’s discretion.
The headroom of an N-bit integer is the number of bits that the integer’s value may be left-shifted without any information being lost. Equivalently, it is one less than the number of leading sign bits.
The headroom of an int16_t array is the minimum of the headroom of each of its int16_t elements.
This function efficiently traverses the elements of b[] to determine its headroom.
b[] represents the 16-bit vector \(\bar b\). b[] must begin at a word-aligned address.
length is the number of elements in b[].
Operation Performed
\[\begin{aligned}
a \leftarrow min\!\{ HR_{16}\left(x_0\right), HR_{16}\left(x_1\right), ...,
HR_{16}\left(x_{length-1}\right) \}
\end{aligned}\]
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = scale - b\_exp\).
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 16-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Get the element-wise maximum of two 16-bit vectors.
a[], b[] and c[] represent the 16-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[], but not on c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr = c\_exp +
c\_shr\).
The function vect_2vec_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Warning
For correct operation, this function requires at least 1 bit of headroom in each mantissa vector after the shifts have been applied.
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 16-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Get the element-wise minimum of two 16-bit vectors.
a[], b[] and c[] represent the 16-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[], but not on c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr = c\_exp +
c\_shr\).
The function vect_2vec_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Warning
For correct operation, this function requires at least 1 bit of headroom in each mantissa vector after the shifts have been applied.
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_nmacc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
Multiply two 16-bit vectors together element-wise.
a[], b[] and c[] represent the 16-bit vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
a_shr is an unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_s16_mul_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
a_shr – [in] Right-shift appled to 32-bit products
Rectification ensures that all outputs are non-negative, changing negative values to 0.
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
Each output element a[k] is set to the value of the corresponding input element b[k] if it is positive, and a[k] is set to zero otherwise.
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
c is the 16-bit scalar \(c\) by which elements of \(\bar b\) are multiplied.
a_shr is an unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) are the mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_s16_scale_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
a_shr – [in] Right-shift appled to 32-bit products
If \(b\) is the mantissa of floating-point value \(b \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input value \(b\)
length – [in] Number of elements in vector \(\bar a\)
Left-shift the elements of a 16-bit vector by a specified number of bits.
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in vectors \(\bar a\) and \(\bar b\).
b_shl is the signed arithmetic left-shift applied to each element of \(\bar b\).
If \(\bar b\) are the mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{b\_shl}\) and \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_shl – [in] Arithmetic left-shift applied to elements of \(\bar b\)
Right-shift the elements of a 16-bit vector by a specified number of bits.
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in vectors \(\bar a\) and \(\bar b\).
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If \(\bar b\) are the mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{-b\_shr}\) and \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_shr – [in] Arithmetic right-shift applied to elements of \(\bar b\)
Compute the square roots of elements of a 16-bit vector.
a[] and b[] represent the 16-bit vectors \(\bar a\) and \(\bar b\) respectively. Each vector must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
depth is the number of most significant bits to calculate of each \(a_k\). For example, a depth value of 8 will only compute the 8 most significant byte of the result, with the remaining byte as 0. The maximum value for this parameter is VECT_SQRT_S16_MAX_DEPTH (31). The time cost of this operation is approximately proportional to the number of bits computed.
Operation Performed
\[\begin{split}\begin{aligned}
& b_k' \leftarrow sat_{16}(\lfloor b_k \cdot 2^{-b\_shr} \rfloor) \\
& a_k \leftarrow \begin{cases}
\sqrt{ b_k' } & b_k' >= 0 \\
0 & otherwise\end{cases} \\
& \qquad\text{ for }k\in 0\ ...\ (length-1) \\
& \qquad\text{ where } \sqrt{\cdot} \text{ computes the most significant } depth
\text{ bits of the square root.}
\end{aligned}\end{split}\]
Block Floating-Point
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = (b\_exp + b\_shr - 14)/2\).
Note that because exponents must be integers, that means \(b\_exp + b\_shr\)must be even.
The function vect_s16_sqrt_prepare() can be used to obtain values for \(a\_exp\) and \(b\_shr\) based on the input exponent \(b\_exp\) and headroom \(b\_hr\).
Notes
This function assumes roots are real. Negative input elements will result in corresponding outputs of 0.
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_shr – [in] Right-shift appled to \(\bar b\)
depth – [in] Number of bits of each output value to compute
a[], b[] and c[] represent the 16-bit vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp + b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_s16_sub_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
b[] represents the 16-bit vector \(\bar b\). b[] must begin at a word-aligned address.
length is the number of elements in \(\bar b\).
Operation Performed
\[\begin{aligned}
a \leftarrow \sum_{k=0}^{length-1} b_k
\end{aligned}\]
Block Floating-Point
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 32-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Accumulate a 16-bit vector chunk into a 32-bit accumulator chunk.
16-bit vector chunk \(\bar b\) is shifted and accumulated into 32-bit accumulator vector chunk \(\bar a\) (acc). This function is used for efficiently accumulating multiple (possibly many) 16-bit vectors together.
The accumulator vector \(\bar a\) stores its elements across two 16-bit vector chunks, which corresponds to how the accumulators are stored internally across VPU registers vD and vR. See split_acc_s32_t for details about the accumulator structure.
The signed arithmetic right-shift b_shr is applied to \(\bar b\) prior to being accumulated into \(\bar a\). When \(\bar b\) and \(\bar a\), are the mantissas of block floating point vectors, using b_shr allows those vectors to have different exponents. This is also important when this function is to be called periodically where each \(\bar b\) may have a different exponent.
b_shr must meet the condition -14<=b_shr<=14 or the behavior of this function is undefined.
The input vpu_ctrl tracks the VPU’s control register state during accumulation. In particular, it is used for keeping track of the headroom of the accumulator vector \(\bar a\). When beginning a sequence of accumulation calls, the value passed in should be initialized to VPU_INT16_CTRL_INIT. On completion, this function returns the updated VPU control register state, which should be passed in as vpu_ctrl on the next accumulation call.
VPU Control Value
The idea is that each call to this function processes only a single ‘chunk’ (in 16-bit mode, a 16-element block) at a time, but the caller usually wants to know the headroom of a whole vector, which may comprise many such chunks. So vpu_ctrl is a value which persists through each of these calls to track the whole vector.
Once all chunks have been accumulated, the VPU_INT16_HEADROOM_FROM_CTRL() macro can be used to get the headroom of the accumulator vector. Note that this will produce a maximum value of 15.
If many vector chunks \(\bar b\) are accumulated into the same accumulators (when using block floating-point, it may be only a few accumulations if the exponent associated with \(\bar b\) is significantly larger than that associated with \(\bar a\)), saturation becomes possible.
Accumulating Many Values
When saturation is possible, the user must monitor the headroom of \(\bar a\) (using the returned value and VPU_INT16_HEADROOM_FROM_CTRL()) to detect when there is no further headroom. As long as there is at least 1 bit of headroom, a call to this function cannot saturate.
Typically, when using block floating-point, this will be handled by:
Right-shift the values of \(\bar a\) using vect_s32_shr()
Increment the exponent associated with \(\bar a\) by the same amount right-shifted
Convert \(\bar a\) back into the split accumulator format using vect_s32_split_accs()
When accumulating, setting b_shr to the exponent associated with \(\bar b\) minus the exponent associated with \(\bar a\) will automatically adjust for the new exponent of \(\bar a\).
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the 32-bit mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\). If \(a\_exp = b\_exp - 8\), then this operation has effectively not changed the values represented.
Notes
The multiplication by \(2^8\) is an artifact of the VPU’s behavior. It turns out to be significantly more efficient to include the factor of \(2^8\). If this is unwanted, vect_s32_shr() can be used with a b_shr value of 8 to remove the scaling afterwards.
The headroom of output vector \(\bar a\) is not returned by this function. The headroom of the output is always 8 bits greater than the headroom of the input.
Parameters:
a – [out] 32-bit output vector \(\bar a\)
b – [in] 16-bit input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
Extract an 8-bit vector containing the most significant byte of a 16-bit vector.
This is a utility function used, for example, in optimizing mixed-width products. The most significant byte of each element is extracted (without rounding or saturation) and inserted into the output vector.
Extract an 8-bit vector containing the least significant byte of a 16-bit vector.
This is a utility function used, for example, in optimizing mixed-width products. The least significant byte of each element is extracted (without rounding or saturation) and inserted into the output vector.
Compute the element-wise absolute value of a 32-bit vector.
a[] and b[] represent the 32-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 64-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Additional Details
Internally the sum accumulates into 8 separate 40-bit accumulators. These accumulators apply symmetric 40-bit saturation logic (with bounds \(\pm (2^{39}-1)\)) with each added element. At the end, the 8 accumulators are summed together into the 64-bit value \(a\) which is returned by this function. No saturation logic is applied at this final step.
Because symmetric 32-bit saturation logic is applied when computing the absolute value, in the corner case where each element is INT32_MIN, each of the 8 accumulators can accumulate \(256\) elements before saturation is possible. Therefore, with \(b\_hr\) bits of headroom, no saturation of intermediate results is possible with fewer than \(2^{11 + b\_hr}\) elements in \(\bar b\).
If the length of \(\bar b\) is greater than \(2^{11 + b\_hr}\), the sum can be computed piece-wise in several calls to this function, with the partial results summed in user code.
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp + b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_s32_add_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
a[], b[] represent the 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
c is the scalar \(c\) to be added to each element of \(\bar b\).
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If elements of \(\bar b\) are the mantissas of BFP vector \( \bar{b} \cdot 2^{b\_exp} \), and \(c\) is the mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp + b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_s32_add_scalar_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Clamp the elements of a 32-bit vector to a specified range.
a[] and b[] represent the 32-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
lower_bound and upper_bound are the lower and upper bounds of the clipping range respectively. These bounds are checked for each element of \(\bar b\) only afterb_shr is applied.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\)before being compared to the upper and lower bounds.
If \(\bar b\) are the mantissas for a BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the exponent \(a\_exp\) of the output BFP vector \(\bar{a} \cdot 2^{a\_exp}\) is given by \(a\_exp = b\_exp + b\_shr\).
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
lower_bound – [in] Lower bound of clipping range
upper_bound – [in] Upper bound of clipping range
b_shr – [in] Arithmetic right-shift applied to elements of \(\bar b\) prior to clipping
If \(\bar b\) and \(\bar c\) are the mantissas of the BFP vectors \( \bar{b} \cdot 2^{b\_exp}
\) and \(\bar{c}\cdot 2^{c\_exp}\), then result \(a\) is the 64-bit mantissa of the result \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + b\_shr + c\_shr + 30\).
If needed, the bit-depth of \(a\) can then be reduced to 32 bits to get a new result \(a' \cdot 2^{a\_exp'}\) where \(a' = a \cdot 2^{-a\_shr}\) and \(a\_exp' = a\_exp +
a\_shr\).
The function vect_s32_dot_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Additional Details
The 30-bit rounding right-shift applied to each of the 64-bit products \(b_k \cdot c_k\) is a feature of the hardware and cannot be avoided. As such, if the input vectors \(\bar b\) and \(\bar c\) together have too much headroom (i.e. \(b\_hr + c\_hr\)), the sum may effectively vanish. To avoid this situation, negative values of b_shr and c_shr may be used (with the stipulation that \(b\_shr \ge -b\_hr\) and \(c\_shr \ge -c\_hr\) if saturation of \(b_k'\) and \(c_k'\) is to be avoided). The less headroom \(b_k'\) and \(c_k'\) have, the greater the precision of the final result.
Internally, each product \((b_k' \cdot c_k' \cdot 2^{-30})\) accumulates into one of eight 40-bit accumulators (which are all used simultaneously) which apply symmetric 40-bit saturation logic (with bounds \(\approx 2^{39}\)) with each value added. The saturating arithmetic employed is not associative and no indication is given if saturation occurs at an intermediate step. To avoid satuation errors, length should be no greater than \(2^{10+b\_hr+c\_hr}\), where \(b\_hr\) and \(c\_hr\) are the headroom of \(\bar b\) and \(\bar c\) respectively.
If the caller’s mantissa vectors are longer than that, the full inner product can be found by calling this function multiple times for partial inner products on sub-sequences of the input vectors, and adding the results in user code.
In many situations the caller may have a priori knowledge that saturation is impossible (or very nearly so), in which case this guideline may be disregarded. However, such situations are application-specific and are well beyond the scope of this documentation, and as such are left to the user’s discretion.
Parameters:
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar b\) and \(\bar c\)
If \(\bar b\) are the mantissas of the BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then floating-point result is \(a \cdot 2^{a\_exp}\), where the 64-bit mantissa \(a\) is returned by this function, and \(a\_exp = 30 + 2 \cdot (b\_exp + b\_shr) \).
The function vect_s32_energy_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Additional Details
The 30-bit rounding right-shift applied to each element of the 64-bit products \((b_k')^2\) is a feature of the hardware and cannot be avoided. As such, if the input vector \(\bar b\) has too much headroom (i.e. \(2\cdot b\_hr\)), the sum may effectively vanish. To avoid this situation, negative values of b_shr may be used (with the stipulation that \(b\_shr \ge
-b\_hr\) if satuartion of \(b_k'\) is to be avoided). The less headroom \(b_k'\) has, the greater the precision of the final result.
Internally, each product \((b_k')^2 \cdot 2^{-30}\) accumulates into one of eight 40-bit accumulators (which are all used simultaneously) which apply symmetric 40-bit saturation logic (with bounds \(\approx 2^{39}\)) with each value added. The saturating arithmetic employed is not associative and no indication is given if saturation occurs at an intermediate step. To avoid saturation errors, length should be no greater than \(2^{10+2\cdot b\_hr}\), where \(b\_hr\) is the headroom of \(\bar b\).
If the caller’s mantissa vector is longer than that, the full result can be found by calling this function multiple times for partial results on sub-sequences of the input, and adding the results in user code.
In many situations the caller may have a priori knowledge that saturation is impossible (or very nearly so), in which case this guideline may be disregarded. However, such situations are application-specific and are well beyond the scope of this documentation, and as such are left to the user’s discretion.
The headroom of an N-bit integer is the number of bits that the integer’s value may be left-shifted without any information being lost. Equivalently, it is one less than the number of leading sign bits.
The headroom of an int32_t array is the minimum of the headroom of each of its int32_t elements.
This function efficiently traverses the elements of a[] to determine its headroom.
x[] represents the 32-bit vector \(\bar x\). x[] must begin at a word-aligned address.
Compute the inverse of elements of a 32-bit vector.
a[] and b[] represent the 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each vector must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
scale is a scaling parameter used to maximize the precision of the result.
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = scale - b\_exp\).
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 32-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Get the element-wise maximum of two 32-bit vectors.
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[], but not on c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr = c\_exp +
c\_shr\).
The function vect_2vec_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Warning
For correct operation, this function requires at least 1 bit of headroom in each mantissa vector after the shifts have been applied.
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 32-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Get the element-wise minimum of two 32-bit vectors.
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[], but not on c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr = c\_exp +
c\_shr\).
The function vect_2vec_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Warning
For correct operation, this function requires at least 1 bit of headroom in each mantissa vector after the shifts have been applied.
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
Multiply one 32-bit vector element-wise by another.
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + b\_shr +
c\_shr + 30\).
The function vect_s32_mul_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
a[] and b[] represent the 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
a[] and b[]represent the 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
c is the 32-bit scalar \(c\) by which each element of \(\bar b\) is multiplied.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and to \(c\).
If \(\bar b\) are the mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp =
b\_exp + c\_exp + b\_shr + c\_shr + 30\).
The function vect_s32_scale_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Set all elements of a 32-bit vector to the specified value.
a[] represents the 32-bit output vector \(\bar a\). a[] must begin at a word-aligned address.
b is the new value to set each element of \(\bar a\) to.
Operation Performed
\[\begin{split}\begin{aligned}
& a_k \leftarrow b \\
& \qquad\text{ for }k\in 0\ ...\ (length-1)
\end{aligned}\end{split}\]
Block Floating-Point
If \(b\) is the mantissa of floating-point value \(b \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Left-shift the elements of a 32-bit vector by a specified number of bits.
a[] and b[] represent the 32-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in vectors \(\bar a\) and \(\bar b\).
b_shl is the signed arithmetic left-shift applied to each element of \(\bar b\).
If \(\bar b\) are the mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{b\_shl}\) and \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_shl – [in] Arithmetic left-shift applied to elements of \(\bar b\)
Right-shift the elements of a 32-bit vector by a specified number of bits.
a[] and b[] represent the 32-bit vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in vectors \(\bar a\) and \(\bar b\).
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If \(\bar b\) are the mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{-b\_shr}\) and \(a\_exp = b\_exp\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_shr – [in] Arithmetic right-shift applied to elements of \(\bar b\)
Compute the square root of elements of a 32-bit vector.
a[] and b[] represent the 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each vector must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
depth is the number of most significant bits to calculate of each \(a_k\). For example, a depth value of 8 will only compute the 8 most significant byte of the result, with the remaining 3 bytes as 0. The maximum value for this parameter is VECT_SQRT_S32_MAX_DEPTH (31). The time cost of this operation is approximately proportional to the number of bits computed.
Operation Performed
\[\begin{split}\begin{aligned}
& b_k' \leftarrow sat_{32}(\lfloor b_k \cdot 2^{-b\_shr} \rfloor) \\
& a_k \leftarrow \sqrt{ b_k' } \\
& \qquad\text{ for }k\in 0\ ...\ (length-1) \\
& \qquad\text{ where } sqrt() \text{ computes the first } depth \text{ bits of the square root.}
\end{aligned}\end{split}\]
Block Floating-Point
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = (b\_exp + b\_shr - 30)/2\).
Note that because exponents must be integers, that means \(b\_exp + b\_shr\)must be even.
The function vect_s32_sqrt_prepare() can be used to obtain values for \(a\_exp\) and \(b\_shr\) based on the input exponent \(b\_exp\) and headroom \(b\_hr\).
Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_shr – [in] Right-shift appled to \(\bar b\)
depth – [in] Number of bits of each output value to compute
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot 2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp +
b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_s32_sub_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and * \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
b[] represents the 32-bit mantissa vector \(\bar b\). b[] must begin at a word-aligned address.
length is the number of elements in \(\bar b\).
Operation Performed
\[\begin{aligned}
a \leftarrow \sum_{k=0}^{length-1} b_k
\end{aligned}\]
Block Floating-Point
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the 64-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Additional Details
Internally, each element accumulates into one of eight 40-bit accumulators (which are all used simultaneously) which apply symmetric 40-bit saturation logic (with bounds \(\approx 2^{39}\)) with each value added. The saturating arithmetic employed is not associative and no indication is given if saturation occurs at an intermediate step. To avoid the possibility of saturation errors, length should be no greater than \(2^{11+b\_hr}\), where \(b\_hr\) is the headroom of \(\bar b\).
If the caller’s mantissa vector is longer than that, the full result can be found by calling this function multiple times for partial results on sub-sequences of the input, and adding the results in user code.
In many situations the caller may have a priori knowledge that saturation is impossible (or very nearly so), in which case this guideline may be disregarded. However, such situations are application-specific and are well beyond the scope of this documentation, and as such are left to the user’s discretion.
Parameters:
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vector \(\bar b\)
Interleave the elements of two vectors into a single vector.
Elements of 32-bit input vectors \(\bar b\) and \(\bar c\) are interleaved into 32-bit output vector \(\bar a\). Each element of \(\bar b\) has a right-shift of \(b\_shr\) applied, and each element of \(\bar c\) has a right-shift of \(c\_shr\) applied.
Alternatively (and equivalently), this function can be conceived of as taking two real vectors \(\bar b\) and \(\bar c\) and forming a new complex vector \(\bar a\) where \(\bar{a} =
\bar{b} + i\cdot\bar{c}\).
If vectors \(\bar b\) and \(\bar c\) each have \(N\) elements, then the resulting \(\bar a\) will have either \(2N\)int32_t elements or (equivalently) \(N\)complex_s32_t elements (and must have space for such).
Each element \(b_k\) of \(\bar b\) will end up as end up as element \(a_{2k}\) of \(\bar a\) (with the bit-shift applied). Each element \(c_k\) will end up as element \(a_{2k+1}\) of \(\bar a\).
a[] is the output vector \(\bar a\).
b[] and c[] are the input vectors \(\bar b\) and \(\bar c\) respectively.
a, b and c must each begin at a double word-aligned (8 byte) address. (see DWORD_ALIGNED).
length is the number \(N\) of int32_t elements in \(\bar b\) and \(\bar c\).
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
c_shr is the signed arithmetic right-shift applied to elements of \(\bar c\).
Deinterleave the real and imaginary parts of a complex 32-bit vector into two separate vectors.
Complex 32-bit input vector \(\bar c\) has its real and imaginary parts (which correspond to the even and odd-indexed elements, if reinterpreted as an int32_t array) split apart to create real 32-bit output vectors \(\bar a\) and \(\bar b\), such that \(\bar{a} = Re{\bar{c}}\) and \(\bar{b} = Im{\bar{c}}\).
a[] and b[] are the real output vectors \(\bar a\) and \(\bar b\) which receive the real and imaginary parts respectively of \(\bar c\). a and b must each begin at a word-aligned address.
c[] is the complex input vector \(\bar c\). c must begin at a double word-aligned address.
length is the number \(N\) of int32_t elements in \(\bar a\) and \(\bar b\) and the number of complex_s32_t in \(\bar c\).
32-bit input vector \(\bar x\) is convolved with a short fixed-point kernel \(\bar b\) to produce 32-bit output vector \(\bar y\). In other words, this function applies the \(K\)th-order FIR filter with coefficients given by \(\bar b\) to the input signal \(\bar x\). The convolution is “valid” in the sense that no output elements are emitted where the filter taps extend beyond the bounds of the input vector, resulting in an output vector \(\bar y\) with fewer elements.
The maximum filter order \(K\) supported by this function is \(7\).
y[] is the output vector \(\bar y\). If input \(\bar x\) has \(N\) elements, and the filter has \(K\) elements, then \(\bar y\) has \(N-2P\) elements, where \(P = \lfloor K / 2 \rfloor\).
x[] is the input vector \(\bar x\) with length \(N\).
b_q30[] is the vector \(\bar b\) of filter coefficients. The coefficients of \(\bar b\) are encoded in a Q2.30 fixed-point format. The effective value of the \(i\)th coefficient is then \(b_i \cdot 2^{-30}\).
x_length is the length \(N\) of \(\bar x\) in elements.
b_length is the length \(K\) of \(\bar b\) in elements (i.e. the number of filter taps). b_length must be one of \( \{ 1, 3, 5, 7 \} \).
To avoid the possibility of saturating any output elements, \(\bar b\) may be constrained such that \( \sum_{i=0}^{K-1} \left|b_i\right| \leq 2^{30} \).
This operation can be applied safely in-place on x[].
Parameters:
y – [out] Output vector \(\bar y\)
x – [in] Input vector \(\bar x\)
b_q30 – [in] Filter coefficient vector \(\bar b\)
x_length – [in] The number of elements \(N\) in vector \(\bar x\)
b_length – [in] The number of elements \(K\) in \(\bar b\)
32-bit input vector \(\bar x\) is convolved with a short fixed-point kernel \(\bar b\) to produce 32-bit output vector \(\bar y\). In other words, this function applies the \(K\)th-order FIR filter with coefficients given by \(\bar b\) to the input signal \(\bar x\). The convolution mode is “same” in that the input vector is effectively padded such that the input and output vectors are the same length. The padding behavior is one of those given by pad_mode_e.
The maximum filter order \(K\) supported by this function is \(7\).
y[] and x[] are the output and input vectors \(\bar y\) and \(\bar x\) respectively.
b_q30[] is the vector \(\bar b\) of filter coefficients. The coefficients of \(\bar b\) are encoded in a Q2.30 fixed-point format. The effective value of the \(i\)th coefficient is then \(b_i \cdot 2^{-30}\).
x_length is the length \(N\) of \(\bar x\) and \(\bar y\) in elements.
b_length is the length \(K\) of \(\bar b\) in elements (i.e. the number of filter taps). b_length must be one of \( \{ 1, 3, 5, 7 \} \).
padding_mode is one of the values from the pad_mode_e enumeration. The padding mode indicates the filter input values for filter taps that have extended beyond the bounds of the input vector \(\bar x\). See pad_mode_e for a list of supported padding modes and associated behaviors.
Operation Performed
\[\begin{split}\begin{aligned}
& \tilde{x}_i = \begin{cases}
\text{determined by padding mode} & i < 0 \\
\text{determined by padding mode} & i \ge N \\
x_i & otherwise \end{cases} \\
& y_k \leftarrow \sum_{l=0}^{K-1} (\tilde{x}_{(k+l-P)} \cdot b_l \cdot 2^{-30} ) \\
& \qquad\text{ for }k\in 0\ ...\ (N-2P) \\
& \qquad\text{ where }P = \lfloor K/2 \rfloor
\end{aligned}\end{split}\]
Additional Details
To avoid the possibility of saturating any output elements, \(\bar b\) may be constrained such that \( \sum_{i=0}^{K-1} \left|b_i\right| \leq 2^{30} \).
Merge a vector of split 32-bit accumulators into a vector of int32_t’s.
Convert a vector of split_acc_s32_t into a vector of int32_t. This is useful when a function (e.g. mat_mul_s8_x_s8_yield_s32) outputs a vector of accumulators in the XS3 VPU’s native split 32-bit format, which has the upper half of each accumulator in the first 32 bytes and the lower half in the following 32 bytes.
This function is most efficient (in terms of cycles/accumulator) when length is a multiple of
In any case, length will be rounded up such that a multiple of 16 accumulators will always be merged.
This function can safely merge accumulators in-place.
Split a vector of int32_t’s into a vector of split_acc_s32_t.
Convert a vector of int32_t into a vector of split_acc_s32_t, the native format for the XS3 VPU’s 32-bit accumulators. This is useful when a function (e.g. mat_mul_s8_x_s8_yield_s32) takes in a vector of accumulators in that native format.
This function is most efficient (in terms of cycles/accumulator) when length is a multiple of
In any case, length will be rounded up such that a multiple of 16 accumulators will always be split.
This function can safely split accumulators in-place.
Compute a power series sum on a vector of Q2.30 values.
This function is used to compute a power series summation on a vector \(\bar b\). \(\bar b\) contains Q2.30 values. \(\bar c\) is a vector containing coefficients to be multiplied by powers of \(\bar b\), and may have any associated exponent. The output is vector \(\bar a\) and has the same exponent as \(\bar c\).
c[] is an array with shape (term_count,VPU_INT32_EPV), where the second axis contains the same value replicated across all VPU_INT32_EPV elements. That is, c[k][i]=c[k][j] for i and j in 0..(VPU_INT32_EPV-1). This is for performance reasons. (For the purpose of this explanation, \(\bar c\) is considered to be single-dimensional, without redundancy.)
Compute the logarithm (in the specified base) of a vector of float_s32_t.
This function computes the logarithm of a vector \(\bar b\) of float_s32_t values. The base of the computed logarithm is given by parameter inv_ln_base_q30. The result is written to output \(\bar a\), a vector of Q8.24 values.
If the desired base is \(D\), then inv_ln_base_q30, represented here by \(R\), should be \(\mathtt{Q30}\left(\frac{1}{ln\left(D\right)}\right)\). That is: the inverse of the natural logarithm of the desired base, expressed as a Q2.30 value. Typically the desired base is known at compile time, so this value will usually be a precomputed constant.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the natural logarithm of a vector of float_s32_t.
This function computes the natural logarithm of a vector \(\bar b\) of float_s32_t values. The result is written to output \(\bar a\), a vector of Q8.24 values.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the base 2 logarithm of a vector of float_s32_t.
This function computes the base 2 logarithm of a vector \(\bar b\) of float_s32_t values. The result is written to output \(\bar a\), a vector of Q8.24 values.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the base 10 logarithm of a vector of float_s32_t.
This function computes the base 10 logarithm of a vector \(\bar b\) of float_s32_t values. The result is written to output \(\bar a\), a vector of Q8.24 values.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the logarithm (in the specified base) of a block floating-point vector.
This function computes the logarithm of the block floating-point vector \(\bar{b}\cdot 2^{b\_exp}\). The base of the computed logarithm is given by parameter inv_ln_base_q30. The result is written to output \(\bar a\), a vector of Q8.24 values.
If the desired base is \(D\), then inv_ln_base_q30, represented here by \(R\), should be \(\mathtt{Q30}\left(\frac{1}{ln\left(D\right)}\right)\). That is: the inverse of the natural logarithm of the desired base, expressed as a Q2.30 value. Typically the desired base is known at compile time, so this value will usually be a precomputed constant.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the natural logarithm of a block floating-point vector.
This function computes the natural logarithm of the block floating-point vector \(\bar{b}\cdot 2^{b\_exp}\). The result is written to output \(\bar a\), a vector of Q8.24 values.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the base 2 logarithm of a block floating-point vector.
This function computes the base 2 logarithm of the block floating-point vector \(\bar{b}\cdot 2^{b\_exp}\). The result is written to output \(\bar a\), a vector of Q8.24 values.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
Compute the base 10 logarithm of a block floating-point vector.
This function computes the base 10 logarithm of the block floating-point vector \(\bar{b}\cdot 2^{b\_exp}\). The result is written to output \(\bar a\), a vector of Q8.24 values.
The resulting \(a_k\) for \(b_k \le 0\) is undefined.
This function computes \(e^{b_k \cdot 2^{-30}}\) for each \(b_k\) in input vector \(\bar b\). The results are placed in output vector \(\bar a\) as Q2.30 values.
This function is meant to compute \(e^x\) for values of \(x\) in the interval \( \left[-0.5, 0.5\right] \). The error grows quickly outside of this range.
This function converts a 32-bit mantissa vector \(\bar b\) into a 16-bit mantissa vector \(\bar a\). Conceptually, the output BFP vector \(\bar{a}\cdot 2^{a\_exp}\) represents the same values as the input BFP vector \(\bar{b}\cdot 2^{b\_exp}\), only with a reduced bit-depth.
In most cases \(b\_shr\) should be \(16 - b\_hr\), where \(b\_hr\) is the headroom of the 32-bit input mantissa vector \(\bar b\).
The output exponent \(a\_exp\) will be given by
\( a\_exp = b\_exp + b\_shr \)
Parameter Details
a[] represents the 16-bit output mantissa vector \(\bar a\).
b[] represents the 32-bit input mantissa vector \(\bar b\).
a[] and b[] must each begin at a word-aligned address.
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
If \(\bar b\) are the 32-bit mantissas of a BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the 16-bit mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr\).
Perform forward FFT on a vector of IEEE754 floats.
This function takes real input vector \(\bar x\) and performs a forward FFT on the signal in-place to get output vector \(\bar{X} = FFT{\bar{x}}\). This implementation is accelerated by converting the IEEE754 float vector into a block floating-point representation to compute the FFT. The resulting BFP spectrum is then converted back to IEEE754 single-precision floats. The operation is performed in-place on x[].
Whereas the input x[] is an array of fft_lengthfloat elements, the output (placed in x[]) is an array of fft_length/2complex_float_t elements, so the input should be cast after calling this.
constunsignedFFT_N=512floattime_series[FFT_N]={...};fft_f32_forward(time_series,FFT_N);complex_float_t*freq_spectrum=(complex_float_t*)&time_series[0];constunsignedFREQ_BINS=FFT_N/2;// e.g. freq_spectrum[FREQ_BINS-1].re
This function takes complex input vector \(\bar X\) and performs an inverse real FFT on the spectrum in-place to get output vector \(\bar{x} = IFFT{\bar{X}}\). This implementation is accelerated by converting the IEEE754 float vector into a block floating-point representation to compute the IFFT. The resulting BFP signal is then converted back to IEEE754 single-precision floats. The operation is performed in-place on X[].
Get the maximum (32-bit BFP) exponent from a vector of IEEE754 floats.
This function is used to determine the BFP exponent to use when converting a vector of IEEE754 single-precision floats into a 32-bit BFP vector.
The exponent returned, if used with vect_f32_to_vect_s32(), is the one which will result in no headroom in the BFP vector — that is, the minimum permissible exponent for the BFP vector. The minimum permissible exponent is derived from the maximum exponent found in the float elements themselves.
More specifically, the FSEXP instruction is used on each element to determine its exponent. The value returned is the maximum exponent given by the FSEXP instruction plus 30.
If required, when converting to a 32-bit BFP vector, additional headroom can be included by adding the amount of required headroom to the exponent returned by this function.
Parameters:
b – [in] Input vector of IEEE754 single-precision floats \(\bar b\)
Convert a vector of IEEE754 single-precision floats into a 32-bit BFP vector.
This function converts a vector of IEEE754 single-precision floats \(\bar b\) into the mantissa vector \(\bar a\) of a 32-bit BFP vector, given BFP vector exponent \(a\_exp\). Conceptually, the elements of output vector \(\bar{a} \cdot 2^{a\_exp}\) represent the same values as those of the input vector.
Because the output exponent \(a\_exp\) is shared by all elements of the output vector, even though the output vector has 32-bit mantissas, precision may be lost on some elements if the exponents of the input elements \(b_k\) span a wide range.
The function vect_f32_max_exponent() can be used to determine the value for \(a\_exp\) which minimizes headroom of the output vector.
Compute the inner product of two IEEE754 float vectors.
This function takes two vectors of IEEE754 single-precision floats and computes their inner product — the sum of the elementwise products. The FMACC instruction is used, granting full precision in the addition.
This function takes two vectors \(\bar b\) and \(\bar c\) of complex IEEE754 single-precision floats and computes the element-wise sum of the two vectors.
a[] is the output vector \(\bar a\) into which results are placed.
b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.
a, b and c each must begin at a double-word-aligned address.
This operation can be performed safely in-place on b[] or c[].
Multiplies together two complex IEEE754 float vectors.
This function takes two complex float vectors \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as \(b_k\) multiplied by \(c_k\) (using complex multiplication).
a[] is the output vector \(\bar a\) into which results are placed.
b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.
a, b and c each must begin at a double-word-aligned address.
This operation can be performed safely in-place on b[] or c[].
Conjugate multiplies together two complex IEEE754 float vectors.
This function takes two complex float vectors \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as \(b_k\) multiplied by the complex conjugate of \(c_k\) (using complex multiplication).
a[] is the output vector \(\bar a\) into which results are placed.
b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.
a, b and c each must begin at a double-word-aligned address.
This operation can be performed safely in-place on b[] or c[].
Adds the product of two complex IEEE754 float vectors to a third float vector.
This function takes three complex float vectors \(\bar a\), \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as input \(a_k\) plus \(b_k\) multiplied by \(c_k\).
a[] is accumulator vector \(\bar a\), serving as both input and output.
b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.
a, b and c each must begin at a double-word-aligned address.
Adds the product of two complex IEEE754 float vectors to a third float vector.
This function takes three complex float vectors \(\bar a\), \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as input \(a_k\) plus \(b_k\) multiplied by the complex conjugate of \(c_k\).
a[] is accumulator vector \(\bar a\), serving as both input and output.
b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.
a, b and c each must begin at a double-word-aligned address.
Convert a 32-bit BFP vector into a vector of IEEE754 single-precision floats.
This function converts a 32-bit mantissa vector and exponent \(\bar b \cdot 2^{b\_exp}\) into a vector of 32-bit IEEE754 single-precision floating-point elements \(\bar a\). Conceptually, the elements of output vector \(\bar a\) represent the same values as those of the input vector.
Because IEEE754 single-precision floats hold fewer mantissa bits, this operation may result in a loss of precision for some elements.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[], b_imag[], c_real[] and c_imag[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the complex 16-bit mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the complex 16-bit mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp +
b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_complex_s16_add_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
a[] and b[]represent the complex 16-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
c is the complex scalar \(c\)to be added to each element of \(\bar b\).
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If elements of \(\bar b\) are the complex mantissas of BFP vector \( \bar{b} \cdot
2^{b\_exp}\), and \(c\) is the mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp +
b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_complex_s16_add_scalar_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Multiply one complex 16-bit vector element-wise by the complex conjugate of another.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[], b_imag[], c_real[] and c_imag[].
length is the number of elements in each of the vectors.
a_shr is the unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \(\bar{b} \cdot 2^{b\_exp}\) and \(c\) is the complex 16-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_complex_s16_mul_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
The headroom of an N-bit integer is the number of bits that the integer’s value may be left-shifted without any information being lost. Equivalently, it is one less than the number of leading sign bits.
The headroom of a complex_s16_t struct is the minimum of the headroom of each of its 16-bit fields, re and im.
The headroom of a complex_s16_t array is the minimum of the headroom of each of its complex_s16_t elements.
This function efficiently traverses the elements of \(\bar x\) to determine its headroom.
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\).
length is the number of elements in b_real[] and b_imag[].
Compute the magnitude of each element of a complex 16-bit vector.
a[] represents the real 16-bit output mantissa vector \(\bar a\).
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[] or b_imag[].
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
rot_table must point to a pre-computed table of complex vectors used in calculating the magnitudes. table_rows is the number of rows in the table. This library is distributed with a default version of the required rotation table. The following symbols can be used to refer to it in user code:
Faster computation (with reduced precision) can be achieved by generating a smaller version of the table. A python script is provided to generate this table.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the real 16-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr\).
The function vect_complex_s16_mag_prepare() can be used to obtain values for \(a\_exp\) and \(b\_shr\) based on the input exponent \(b\_exp\) and headroom \(b\_hr\).
Multiply one complex 16-bit vector element-wise by another, and add the result to an accumulator.
acc_real[] and acc_imag[] together represent the complex 16-bit accumulator mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is acc_real[k], and each \(Im\{a_k\}\) is acc_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address.
length is the number of elements in each of the vectors.
acc_shr is the signed arithmetic right-shift applied to the accumulators \(a_k\).
bc_sat is the unsigned arithmetic right-shift applied to the product of \(b_k\) and \(c_k\) before being added to the accumulator.
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \(
a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
Multiply one complex 16-bit vector element-wise by another, and subtract the result from an accumulator.
acc_real[] and acc_imag[] together represent the complex 16-bit accumulator mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is acc_real[k], and each \(Im\{a_k\}\) is acc_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address.
length is the number of elements in each of the vectors.
acc_shr is the signed arithmetic right-shift applied to the accumulators \(a_k\).
bc_sat is the unsigned arithmetic right-shift applied to the product of \(b_k\) and \(c_k\) before being subtracted from the accumulator.
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \(
a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_nmacc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
Multiply one complex 16-bit vector element-wise by the complex conjugate of another, and add the result to an accumulator.
acc_real[] and acc_imag[] together represent the complex 16-bit accumulator mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is acc_real[k], and each \(Im\{a_k\}\) is acc_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address.
length is the number of elements in each of the vectors.
acc_shr is the signed arithmetic right-shift applied to the accumulators \(a_k\).
bc_sat is the unsigned arithmetic right-shift applied to the product of \(b_k\) and \(c_k^*\) before being added to the accumulator.
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \(
a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
Multiply one complex 16-bit vector element-wise by the complex conjugate of another, and subtract the result from an accumulator.
acc_real[] and acc_imag[] together represent the complex 16-bit accumulator mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is acc_real[k], and each \(Im\{a_k\}\) is acc_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address.
length is the number of elements in each of the vectors.
acc_shr is the signed arithmetic right-shift applied to the accumulators \(a_k\).
bc_sat is the unsigned arithmetic right-shift applied to the product of \(b_k\) and \(c_k^*\) before being subtracted from the accumulator.
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \(
a\_exp + acc\_shr = b\_exp + c\_exp + bc\_sat \).
The function vect_complex_s16_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\) and \(bc\_sat\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
Multiply one complex 16-bit vector element-wise by another.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[], b_imag[], c_real[] and c_imag[].
length is the number of elements in each of the vectors.
a_shr is the unsigned arithmetic right-shift applied to the 32-bit accumulators holding intermediate results.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \(\bar{b} \cdot 2^{b\_exp}\) and \(c\) is the complex 16-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_complex_s16_mul_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Multiply a complex 16-bit vector element-wise by a real 16-bit vector.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] represents the real 16-bit input mantissa vector \(\bar c\).
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[], b_imag[] and c_real[].
length is the number of elements in each of the vectors.
a_shr is the unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 16-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_s16_real_mul_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Multiply a complex 16-bit vector by a real scalar.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[] and b_imag[].
c is the real 16-bit input mantissa \(c\).
length is the number of elements in each of the vectors.
a_shr is an unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 16-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_complex_s16_real_scale_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a_real – [out] Real part of complex output vector \(\bar a\)
Multiply a complex 16-bit vector by a complex 16-bit scalar.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[] and b_imag[].
c_real and c_imag are the real and imaginary parts of the complex 16-bit input mantissa \(c\).
length is the number of elements in each of the vectors.
a_shr is the unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 16-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_complex_s16_scale_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a_real – [out] Real part of complex output vector \(\bar a\)
Set each element of a complex 16-bit vector to a specified value.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k]. Each must begin at a word-aligned address.
b_real and b_imag are the real and imaginary parts of the complex 16-bit input mantissa \(b\). Each a_real[k] will be set to b_real. Each a_imag[k] will be set to b_imag.
length is the number of elements in a_real[] and a_imag[].
If \(b\) is the mantissa of floating-point value \(b \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a_real – [out] Real part of complex output vector \(\bar a\)
Left-shift each element of a complex 16-bit vector by a specified number of bits.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[] and b_imag[].
length is the number of elements in \(\bar a\) and \(\bar b\).
b_shl is the signed arithmetic left-shift applied to each element of \(\bar b\).
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the complex 16-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{b\_shl}\) and \(a\_exp = b\_exp\).
Parameters:
a_real – [out] Real part of complex output vector \(\bar a\)
Right-shift each element of a complex 16-bit vector by a specified number of bits.
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[] and b_imag[].
length is the number of elements in \(\bar a\) and \(\bar b\).
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the complex 16-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{-b\_shr}\) and \(a\_exp = b\_exp\).
Parameters:
a_real – [out] Real part of complex output vector \(\bar a\)
Get the squared magnitudes of elements of a complex 16-bit vector.
a[] represents the real 16-bit output mantissa vector \(\bar a\).
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
Each of the input vectors must begin at a word-aligned address.
length is the number of elements in each of the vectors.
a_shr is the unsigned arithmetic right-shift applied to the 32-bit accumulators holding the penultimate results.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the real 16-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(a\_exp = 2 \cdot b\_exp + a\_shr\).
The function vect_complex_s16_squared_mag_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponent \(b\_exp\) and headroom \(b\_hr\).
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\). Each \(Re\{a_k\}\) is a_real[k], and each \(Im\{a_k\}\) is a_imag[k].
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
c_real[] and c_imag[] together represent the complex 16-bit input mantissa vector \(\bar c\). Each \(Re\{c_k\}\) is c_real[k], and each \(Im\{c_k\}\) is c_imag[k].
Each of the input vectors must begin at a word-aligned address. This operation can be performed safely in-place on inputs b_real[], b_imag[], c_real[] and c_imag[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the complex 16-bit mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the complex 16-bit mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp +
b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_complex_s16_sub_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Get the sum of elements of a complex 16-bit vector.
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\), and must both begin at a word-aligned address. Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the returned value \(a\) is the complex 32-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
b_real – [in] Real part of complex input vector \(\bar b\)
b_imag – [in] Imaginary part of complex input vector \(\bar b\)
length – [in] Number of elements in vector \(\bar b\).
Convert a complex 16-bit vector into a complex 32-bit vector.
a[] represents the complex 32-bit output vector \(\bar a\). It must begin at a double word (8-byte) aligned address.
b_real[] and b_imag[] together represent the complex 16-bit input mantissa vector \(\bar b\). Each \(Re\{b_k\}\) is b_real[k], and each \(Im\{b_k\}\) is b_imag[k].
The parameter length is the number of elements in each of the vectors.
length is the number of elements in each of the vectors.
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the complex 32-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Notes
The headroom of output vector \(\bar a\) is not returned by this function. The headroom of the output is always 16 bits greater than the headroom of the input.
Parameters:
a – [out] Complex output vector \(\bar a\).
b_real – [in] Real part of complex input vector \(\bar b\).
b_imag – [in] Imaginary part of complex input vector \(\bar b\).
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
a[], b[] and c[] represent the complex 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the complex 32-bit mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the complex 32-bit mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp +
b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_complex_s32_add_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
a[] and b[]represent the complex 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
c is the complex scalar \(c\)to be added to each element of \(\bar b\).
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If elements of \(\bar b\) are the complex mantissas of BFP vector \( \bar{b} \cdot
2^{b\_exp}\), and \(c\) is the mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp + b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_complex_s32_add_scalar_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Multiply one complex 32-bit vector element-wise by the complex conjugate of another.
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 32-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + a\_shr\).
The function vect_complex_s32_conj_mul_prepare() can be used to obtain values for \(a\_exp\) and \(a\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
The headroom of an N-bit integer is the number of bits that the integer’s value may be left-shifted without any information being lost. Equivalently, it is one less than the number of leading sign bits.
The headroom of a complex_s32_t struct is the minimum of the headroom of each of its 32-bit fields, re and im.
The headroom of a complex_s32_t array is the minimum of the headroom of each of its complex_s32_t elements.
This function efficiently traverses the elements of \(\bar x\) to determine its headroom.
x[] represents the complex 32-bit vector \(\bar x\). x[] must begin at a word-aligned address.
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + b\_shr + c\_shr \).
The function vect_complex_s32_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + b\_shr + c\_shr \).
The function vect_complex_s32_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + b\_shr + c\_shr \).
The function vect_complex_s32_conj_macc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
If inputs \(\bar b\) and \(\bar c\) are the mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), and input \(\bar a\) is the accumulator BFP vector \(\bar{a} \cdot 2^{a\_exp}\), then the output values of \(\bar a\) have the exponent \(2^{a\_exp + acc\_shr}\).
For accumulation to make sense mathematically, \(bc\_sat\) must be chosen such that \( a\_exp + acc\_shr = b\_exp + c\_exp + b\_shr + c\_shr \).
The function vect_complex_s32_conj_nmacc_prepare() can be used to obtain values for \(a\_exp\), \(acc\_shr\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(a\_exp\), \(b\_exp\) and \(c\_exp\) and the input headrooms \(a\_hr\), \(b\_hr\) and \(c\_hr\).
Compute the magnitude of each element of a complex 32-bit vector.
a[] represents the real 32-bit output mantissa vector \(\bar a\).
b[] represents the complex 32-bit input mantissa vector \(\bar b\).
a[] and b[] must each begin at a word-aligned address.
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
rot_table must point to a pre-computed table of complex vectors used in calculating the magnitudes. table_rows is the number of rows in the table. This library is distributed with a default version of the required rotation table. The following symbols can be used to refer to it in user code:
Faster computation (with reduced precision) can be achieved by generating a smaller version of the table. A python script is provided to generate this table.
Todo:
Point to documentation page on generating this table.
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the real 32-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr\).
The function vect_complex_s32_mag_prepare() can be used to obtain values for \(a\_exp\) and \(b\_shr\) based on the input exponent \(b\_exp\) and headroom \(b\_hr\).
Multiply one complex 32-bit vector element-wise by another.
a[], b[] and c[] represent the 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 32-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + b\_shr + c\_shr\).
The function vect_complex_s32_mul_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 32-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + b\_shr + c\_shr\).
The function vect_complex_s32_real_mul_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
If \(\bar b\) are the complex 16-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 16-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + b\_shr + c\_shr\).
The function vect_complex_s32_real_scale_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a – [out] Complex output vector \(\bar a\)
b – [in] Complex input vector \(\bar b\)
c – [in] Complex input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\), and \(\bar c\)
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \) and \(c\) is the complex 32-bit mantissa of floating-point value \(c \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot
2^{a\_exp}\), where \(a\_exp = b\_exp + c\_exp + b\_shr + c\_shr\).
The function vect_complex_s32_mul_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Parameters:
a – [out] Complex output vector \(\bar a\).
b – [in] Complex input vector \(\bar b\).
c_real – [in] Real part of \(c\)
c_imag – [in] Imaginary part of \(c\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\).
If \(b\) is the mantissa of floating-point value \(b \cdot 2^{b\_exp}\), then the output vector \(\bar a\) are the mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\), where \(a\_exp = b\_exp\).
Parameters:
a – [out] Complex output vector \(\bar a\)
b_real – [in] Value to set real part of elements of \(\bar a\) to
b_imag – [in] Value to set imaginary part of elements of \(\bar a\) to
Left-shift each element of a complex 32-bit vector by a specified number of bits.
a[] and b[] represent the complex 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in \(\bar a\) and \(\bar b\).
b_shl is the signed arithmetic left-shift applied to each element of \(\bar b\).
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the complex 32-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{b\_shl}\) and \(a\_exp = b\_exp\).
Parameters:
a – [out] Complex output vector \(\bar a\)
b – [in] Complex input vector \(\bar b\)
length – [in] Number of elements in vector \(\bar b\)
Right-shift each element of a complex 32-bit vector by a specified number of bits.
a[] and b[] represent the complex 32-bit mantissa vectors \(\bar a\) and \(\bar b\) respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[].
length is the number of elements in \(\bar a\) and \(\bar b\).
b_shr is the signed arithmetic right-shift applied to each element of \(\bar b\).
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the complex 32-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(\bar{a} = \bar{b} \cdot 2^{-b\_shr}\) and \(a\_exp = b\_exp\).
Parameters:
a – [out] Complex output vector \(\bar a\)
b – [in] Complex input vector \(\bar b\)
length – [in] Number of elements in vector \(\bar b\)
Computes the squared magnitudes of elements of a complex 32-bit vector.
a[] represents the complex 32-bit mantissa vector \(\bar a\). b[] represents the real 32-bit mantissa vector \(\bar b\). Each must begin at a word-aligned address.
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift appled to each element of \(\bar b\).
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \( \bar{b} \cdot 2^{b\_exp} \), then the resulting vector \(\bar a\) are the real 32-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(a\_exp = 2 \cdot (b\_exp + b\_shr)\).
The function vect_complex_s32_squared_mag_prepare() can be used to obtain values for \(a\_exp\) and \(b\_shr\) based on the input exponent \(b\_exp\) and headroom \(b\_hr\).
a[], b[] and c[] represent the complex 32-bit mantissa vectors \(\bar a\), \(\bar b\) and \(\bar c\)
respectively. Each must begin at a word-aligned address. This operation can be performed safely in-place on b[] or c[].
length is the number of elements in each of the vectors.
b_shr and c_shr are the signed arithmetic right-shifts applied to each element of \(\bar b\) and \(\bar c\) respectively.
If \(\bar b\) and \(\bar c\) are the complex 32-bit mantissas of BFP vectors \( \bar{b} \cdot
2^{b\_exp} \) and \(\bar{c} \cdot 2^{c\_exp}\), then the resulting vector \(\bar a\) are the complex 32-bit mantissas of BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
In this case, \(b\_shr\) and \(c\_shr\)must be chosen so that \(a\_exp = b\_exp +
b\_shr = c\_exp + c\_shr\). Adding or subtracting mantissas only makes sense if they are associated with the same exponent.
The function vect_complex_s32_sub_prepare() can be used to obtain values for \(a\_exp\), \(b\_shr\) and \(c\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
If \(\bar b\) are the mantissas of BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then \(a\) is the complex 64-bit mantissa of floating-point value \(a \cdot 2^{a\_exp}\), where \(a\_exp
= b\_exp + b\_shr\).
The function vect_complex_s32_sum_prepare() can be used to obtain values for \(a\_exp\) and \(b\_shr\) based on the input exponents \(b\_exp\) and \(c\_exp\) and the input headrooms \(b\_hr\) and \(c\_hr\).
Additional Details
Internally the sum accumulates into four separate complex 40-bit accumulators. These accumulators apply symmetric 40-bit saturation logic (with bounds \(\pm 2^{39}-1\)) with each added element. At the end, the 4 accumulators are summed together into the 64-bit fields of a. No saturation logic is applied at this final step.
In the most extreme case, each \(b_k\) may be \(-2^{31}\). \(256\) of these added into the same accumulator is \(-2^{39}\) which would saturate to \(-2^{39}+1\), introducing 1 LSb of error (which may or may not be acceptable given a particular circumstance). The final result for each part then may be as large as \(4\cdot(-2^{39}+1) = -2^{41}+4 \), each fitting into a 42-bit signed integer.
Reverses the order of the tail of a complex 32-bit vector.
Reverses the order of elements in the tail of the complex 32-bit vector \(\bar x\). The tail of \(\bar x\), in this context, is all elements of \(\bar x\) except for \(x_0\). In other words, the first element \(x_0\) remains where it is, and the remaining \(length-1\) elements are rearranged to have their order reversed.
This function is used when performing a forward or inverse FFT on a single sequence of real values (i.e. the mono FFT), and operates in-place on x[].
Parameter Details
x[] represents the complex 32-bit vector \(\bar x\), which is both an input to and an output of this function. x[] must begin at a word-aligned address.
Get the complex conjugate of a complex 32-bit vector.
The complex conjugate of a complex scalar \(z = x + yi\) is \(z^* = x - yi\). This function computes the complex conjugate of each element of \(\bar b\) (negates the imaginary part of each element) and places the result in \(\bar a\).
a[] is the complex 32-bit output vector \(\bar a\).
b[] is the complex 32-bit input vector \(\bar b\).
Both a and b must point to word-aligned addresses.
length is the number of elements in \(\bar a\) and \(\bar b\).
Convert a complex 32-bit vector into a complex 16-bit vector.
This function converts a complex 32-bit mantissa vector \(\bar b\) into a complex 16-bit mantissa vector \(\bar a\). Conceptually, the output BFP vector \(\bar{a}\cdot 2^{a\_exp}\) represents the same value as the input BFP vector \(\bar{b}\cdot 2^{b\_exp}\), only with a reduced bit-depth.
In most cases \(b\_shr\) should be \(16 - b\_hr\), where \(b\_hr\) is the headroom of the 32-bit input mantissa vector \(\bar b\). The output exponent \(a\_exp\) will then be given by
\( a\_exp = b\_exp + b\_shr \)
Parameter Details
a_real[] and a_imag[] together represent the complex 16-bit output mantissa vector \(\bar a\), with the real part of each \(a_k\) going in a_real[] and the imaginary part going in a_imag[].
b[] represents the complex 32-bit mantissa vector \(\bar b\).
a_real[], a_imag[] and b[] must each begin at a word-aligned address.
length is the number of elements in each of the vectors.
b_shr is the signed arithmetic right-shift applied to elements of \(\bar b\).
If \(\bar b\) are the complex 32-bit mantissas of a BFP vector \(\bar{b} \cdot 2^{b\_exp}\), then the resulting vector \(\bar a\) are the complex 16-bit mantissas of BFP vector \(\bar{a}
\cdot 2^{a\_exp}\), where \(a\_exp = b\_exp + b\_shr\).
Multiply an 8-bit matrix by a 16-bit vetor for a 32-bit result vector.
This function multiplies an 8-bit \(M \times N\) matrix \(\bar W\) by a 16-bit \(N\)-element column vector \(\bar v\) and returns the result as a 32-bit \(M\)-element vector \(\bar a\).
output is the output vector \(\bar a\).
matrix is the matrix \(\bar W\).
input_vect is the vector \(\bar v\).
matrix and input_vect must both begin at a word-aligned offsets.
M_rows and N_rows are the dimensions \(M\) and \(N\) of matrix \(\bar W\). \(M\) must be a multiple of 16, and \(N\) must be a multiple of 32.
scratch is a pointer to a word-aligned buffer that this function may use to store intermediate results. This buffer must be at least \(N\) bytes long.
The result of this multiplication is exact, so long as saturation does not occur.
Parameters:
output – [inout] The output vector \(\bar a\)
matrix – [in] The weight matrix \(\bar W\)
input_vect – [in] The input vector \(\bar v\)
M_rows – [in] The number of rows \(M\) in matrix \(\bar W\)
N_cols – [in] The number of columns \(N\) in matrix \(\bar W\)
scratch – [in] Scratch buffer required by this function.
Add a scalar to a vector. This works for 8, 16 or 32 bits, real or complex.
length_bytes is the total number of bytes to be output. So, for 16-bit vectors, length_bytes is twice the number of elements, whereas for complex 32-bit vectors, length_bytes is 8 times the number of elements.
c and d are the values that populate the internal buffer to be added to the input vector as follows: Internally an 8 word (32 byte) buffer is allocated (on the stack). Even-indexed words are populated with c and odd-indexed words are populated with d. For real vectors, c and d should be the same value — the reason for d is to allow this same function to work for complex 32-bit vectors. This also means that for 16-bit vectors, the value to be added needs to be duplicated in both the higher 2 bytes and lower 2 bytes of the word.
mode_bits should be 0x0000 for 32-bit mode, 0x0100 for 16-bit mode or 0x0200 for 8-bit mode.
Obtain the output exponent, input shift and modified bounds used by vect_s16_clip().
This function is used in conjunction with vect_s16_clip() to bound the elements of a 32-bit BFP vector to a specified range.
This function computes a_exp, b_shr, lower_bound and upper_bound.
a_exp is the exponent associated with the 16-bit mantissa vector \(\bar a\) computed by vect_s32_clip().
b_shr is the shift parameter required by vect_s16_clip() to achieve the output exponent a_exp.
lower_bound and upper_bound are the 16-bit mantissas which indicate the lower and upper clipping bounds respectively. The values are modified by this function, and the resulting values should be passed along to vect_s16_clip().
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
bound_exp is the exponent associated with the bound mantissas lower_bound and upper_bound respectively.
b_hr is the headroom of \(\bar b\). If unknown, it can be obtained using vect_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Obtain the output exponent and scaling parameter used by vect_s16_inverse().
This function is used in conjunction with vect_s16_inverse() to compute the inverse of elements of a 16-bit BFP vector.
This function computes a_exp and scale.
a_exp is the exponent associated with output mantissa vector \(\bar a\), and must be chosen to avoid overflow in the smallest element of the input vector, which when inverted becomes the largest output element. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation. The a_exp chosen by this function is derived from the exponent and smallest element of the input vector.
scale is a scaling parameter used by vect_s16_inverse() to achieve the chosen output exponent.
b[] is the input mantissa vector \(\bar b\).
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
Obtain the output exponent and shifts needed by vect_s16_macc().
This function is used in conjunction with vect_s16_macc() to perform an element-wise multiply-accumlate of 16-bit BFP vectors.
This function computes new_acc_exp and acc_shr and bc_sat, which are selected to maximize precision in the resulting accumulator vector without causing saturation of final or intermediate values. Normally the caller will pass these outputs to their corresponding inputs of vect_s16_macc().
acc_exp is the exponent associated with the accumulator mantissa vector \(\bar a\) prior to the operation, whereas new_acc_exp is the exponent corresponding to the updated accumulator vector.
b_exp and c_exp are the exponents associated with the complex input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
acc_hr, b_hr and c_hr are the headrooms of \(\bar a\), \(\bar b\) and \(\bar c\) respectively. If the headroom of any of these vectors is unknown, it can be obtained by calling vect_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the acc_shr and bc_sat produced by this function can be adjusted according to the following:
// Presumed to be set somewhereexponent_tacc_exp,b_exp,c_exp;headroom_tacc_hr,b_hr,c_hr;exponent_tdesired_exp;...// Call prepareright_shift_tacc_shr,bc_sat;vect_s16_macc_prepare(&acc_exp,&acc_shr,&bc_sat,acc_exp,b_exp,c_exp,acc_hr,b_hr,c_hr);// Modify resultsright_shift_tmant_shr=desired_exp-acc_exp;acc_exp+=mant_shr;acc_shr+=mant_shr;bc_sat+=mant_shr;// acc_shr and bc_sat may now be used in a call to vect_s16_macc()
When applying the above adjustment, the following conditions should be maintained:
bc_sat>=0 (bc_sat is an unsigned right-shift)
acc_shr>-acc_hr (Shifting any further left may cause saturation)
It is up to the user to ensure any such modification does not result in saturation or unacceptable loss of precision.
Obtain the output exponent and output shift used by vect_s16_mul().
This function is used in conjunction with vect_s16_mul() to perform an element-wise multiplication of two 16-bit BFP vectors.
This function computes a_exp and a_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the input vectors.
a_shr is an arithmetic right-shift applied by vect_complex_s16_mul() to the 32-bit products of input elements to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the a_shr produced by this function can be adjusted according to the following:
exponent_ta_exp;right_shift_ta_shr;vect_s16_mul_prepare(&a_exp,&a_shr,b_exp,c_exp,b_hr,c_hr);exponent_tdesired_exp=...;// Value known a prioria_shr=a_shr+(desired_exp-a_exp);a_exp=desired_exp;
When applying the above adjustment, the following conditions should be maintained:
a_shr>=0
Be aware that using a smaller value than strictly necessary for a_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT16_MIN will instead saturate to -INT16_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
Obtain the output exponent and output shift used by vect_s16_scale().
This function is used in conjunction with vect_s16_scale() to perform multiplication of a 16-bit BFP vector \(\bar{b} \cdot 2^{b\_exp}\) by a 16-bit scalar \(c \cdot 2^{c\_exp}\). The result is another 16-bit BFP vector \(\bar{a} \cdot 2^{a\_exp}\).
This function computes a_exp and a_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the inputs.
a_shr is an arithmetic right-shift applied by vect_complex_s16_scale() to the 32-bit products of input elements to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with \(\bar b\) and \(c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(c\) respectively. If the headroom of \(\bar b\) or \(c\) are unknown, they can be obtained by calling vect_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the a_shr produced by this function can be adjusted according to the following:
exponent_ta_exp;right_shift_ta_shr;vect_s16_scale_prepare(&a_exp,&a_shr,b_exp,c_exp,b_hr,c_hr);exponent_tdesired_exp=...;// Value known a prioria_shr=a_shr+(desired_exp-a_exp);a_exp=desired_exp;
When applying the above adjustment, the following conditions should be maintained:
a_shr>=0
Be aware that using a smaller value than strictly necessary for a_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT16_MIN will instead saturate to -INT16_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
Obtain the output exponent and shift parameter used by vect_s16_sqrt().
This function is used in conjunction withx vect_s16_sqrt() to compute the square root of elements of a 16-bit BFP vector.
This function computes a_exp and b_shr.
a_exp is the exponent associated with output mantissa vector \(\bar a\), and should be chosen to maximize the precision of the results. To that end, this function chooses a_exp to be the smallest exponent known to avoid saturation of the resulting mantissa vector \(\bar a\). It is derived from the exponent and headroom of the input BFP vector.
b_shr is the shift parameter required by vect_s16_sqrt() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If it is unknown, it can be obtained using vect_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr produced by this function can be adjusted according to the following:
exponent_ta_exp;right_shift_tb_shr;vect_s16_mul_prepare(&a_exp,&b_shr,b_exp,c_exp,b_hr,c_hr);exponent_tdesired_exp=...;// Value known a priorib_shr=b_shr+(desired_exp-a_exp);a_exp=desired_exp;
When applying the above adjustment, the following condition should be maintained:
b_hr+b_shr>=0
Be aware that using smaller values than strictly necessary for b_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Also, if a larger exponent is used than necessary, a larger depth parameter (see vect_s16_sqrt()) will be required to achieve the same precision, as the results are computed bit by bit, starting with the most significant bit.
\(\bar b\) and \(\bar c\) are the input mantissa vectors with exponents \(b\_exp\) and \(c\_exp\), which are shared by each element of their respective vectors. \(\bar a\) is the output mantissa vector with exponent \(a\_exp\). Two additional properties, \(b\_hr\) and \(c\_hr\), which are the headroom of mantissa vectors \(\bar b\) and \(\bar c\) respectively, are required by this function.
In order to avoid any overflows in the output mantissas, the output exponent \(a\_exp\) must be chosen such that the largest (in the sense of absolute value) possible output mantissa will fit into the allotted space (e.g. 32 bits for vect_s32_add()). Once \(a\_exp\) is chosen, the input bit-shifts \(b\_shr\) and \(c\_shr\) are calculated to achieve that resulting exponent.
This function chooses \(a\_exp\) to be the minimum exponent known to avoid overflows, given the input exponents ( \(b\_exp\) and \(c\_exp\)) and input headroom ( \(b\_hr\) and \(c\_hr\)).
This function is used calculate the output exponent and input bit-shifts for each of the following functions:
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr and c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
If \(b\_hr\) or \(c\_hr\) are unknown, they can be calculated using the appropriate headroom function (e.g. vect_complex_s16_headroom() for complex 16-bit vectors) or the value 0 can always be safely used (but may result in reduced precision).
b_shr – [out] Signed arithmetic right-shift to be applied to elements of \(\bar b\). Used by the function which computes the output mantissas \(\bar a\)
c_shr – [out] Signed arithmetic right-shift to be applied to elements of \(\bar c\). Used by the function which computes the output mantissas \(\bar a\)
Obtain the output exponent, input shift and modified bounds used by vect_s32_clip().
This function is used in conjunction with vect_s32_clip() to bound the elements of a 32-bit BFP vector to a specified range.
This function computes a_exp, b_shr, lower_bound and upper_bound.
a_exp is the exponent associated with the 32-bit mantissa vector \(\bar a\) computed by vect_s32_clip().
b_shr is the shift parameter required by vect_s32_clip() to achieve the output exponent a_exp.
lower_bound and upper_bound are the 32-bit mantissas which indicate the lower and upper clipping bounds respectively. The values are modified by this function, and the resulting values should be passed along to vect_s32_clip().
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
bound_exp is the exponent associated with the bound mantissas lower_bound and upper_bound respectively.
b_hr is the headroom of \(\bar b\). If unknown, it can be obtained using vect_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Obtain the output exponent and input shift used by vect_s32_dot().
This function is used in conjunction with vect_s32_dot() to compute the inner product of two 32-bit BFP vectors.
This function computes a_exp, b_shr and c_shr.
a_exp is the exponent associated with the 64-bit mantissa \(a\) returned by vect_s32_dot(), and must be chosen to be large enough to avoid saturation when \(a\) is computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms associated with the input vectors.
b_shr and c_shr are the shift parameters required by vect_s32_dot() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If either is unknown, they can be obtained using vect_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
length is the number of elements in the input mantissa vectors \(\bar b\) and \(\bar c\).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr or c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Obtain the output exponent and input shift used by vect_s32_energy().
This function is used in conjunction with vect_s32_energy() to compute the inner product of a 32-bit BFP vector with itself.
This function computes a_exp and b_shr.
a_exp is the exponent associated with the 64-bit mantissa \(a\) returned by vect_s32_energy(), and must be chosen to be large enough to avoid saturation when \(a\) is computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponent and headroom associated with the input vector.
b_shr is the shift parameter required by vect_s32_energy() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If it is unknown, it can be obtained using vect_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
length is the number of elements in the input mantissa vector \(\bar b\).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);
When applying the above adjustment, the following condition should be maintained:
b_hr+b_shr>=0
Be aware that using smaller values than strictly necessary for b_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
This function is used in conjunction with vect_s32_inverse() to compute the inverse of elements of a 32-bit BFP vector.
This function computes a_exp and scale.
a_exp is the exponent associated with output mantissa vector \(\bar a\), and must be chosen to avoid overflow in the smallest element of the input vector, which when inverted becomes the largest output element. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation. The a_exp chosen by this function is derived from the exponent and smallest element of the input vector.
scale is a scaling parameter used by vect_s32_inverse() to achieve the chosen output exponent.
b[] is the input mantissa vector \(\bar b\).
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
Obtain the output exponent and shifts needed by vect_s32_macc().
This function is used in conjunction with vect_s32_macc() to perform an element-wise multiply-accumlate of 32-bit BFP vectors.
This function computes new_acc_exp, acc_shr, b_shr and c_shr, which are selected to maximize precision in the resulting accumulator vector without causing saturation of final or intermediate values. Normally the caller will pass these outputs to their corresponding inputs of vect_s32_macc().
acc_exp is the exponent associated with the accumulator mantissa vector \(\bar a\) prior to the operation, whereas new_acc_exp is the exponent corresponding to the updated accumulator vector.
b_exp and c_exp are the exponents associated with the complex input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
acc_hr, b_hr and c_hr are the headrooms of \(\bar a\), \(\bar b\) and \(\bar c\) respectively. If the headroom of any of these vectors is unknown, it can be obtained by calling vect_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the acc_shr and bc_sat produced by this function can be adjusted according to the following:
// Presumed to be set somewhereexponent_tacc_exp,b_exp,c_exp;headroom_tacc_hr,b_hr,c_hr;exponent_tdesired_exp;...// Call prepareright_shift_tacc_shr,b_shr,c_shr;vect_s32_macc_prepare(&acc_exp,&acc_shr,&b_shr,&c_shr,acc_exp,b_exp,c_exp,acc_hr,b_hr,c_hr);// Modify resultsright_shift_tmant_shr=desired_exp-acc_exp;acc_exp+=mant_shr;acc_shr+=mant_shr;b_shr+=mant_shr;c_shr+=mant_shr;// acc_shr, b_shr and c_shr may now be used in a call to vect_s32_macc()
When applying the above adjustment, the following conditions should be maintained:
acc_shr>-acc_hr (Shifting any further left may cause saturation)
b_shr=>-b_hr (Shifting any further left may cause saturation)
c_shr=>-c_hr (Shifting any further left may cause saturation)
It is up to the user to ensure any such modification does not result in saturation or unacceptable loss of precision.
Obtain the output exponent and input shifts used by vect_s32_mul().
This function is used in conjunction with vect_s32_mul() to perform an element-wise multiplication of two 32-bit BFP vectors.
This function computes a_exp, b_shr, c_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the input vectors.
b_shr and c_shr are the shift parameters required by vect_complex_s32_mul() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr and c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT32_MIN will instead saturate to -INT32_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
Obtain the output exponent and shift parameter used by vect_s32_sqrt().
This function is used in conjunction withx vect_s32_sqrt() to compute the square root of elements of a 32-bit BFP vector.
This function computes a_exp and b_shr.
a_exp is the exponent associated with output mantissa vector \(\bar a\), and should be chosen to maximize the precision of the results. To that end, this function chooses a_exp to be the smallest exponent known to avoid saturation of the resulting mantissa vector \(\bar a\). It is derived from the exponent and headroom of the input BFP vector.
b_shr is the shift parameter required by vect_s32_sqrt() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If it is unknown, it can be obtained using vect_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr produced by this function can be adjusted according to the following:
exponent_ta_exp;right_shift_tb_shr;vect_s16_mul_prepare(&a_exp,&b_shr,b_exp,c_exp,b_hr,c_hr);exponent_tdesired_exp=...;// Value known a priorib_shr=b_shr+(desired_exp-a_exp);a_exp=desired_exp;
When applying the above adjustment, the following condition should be maintained:
b_hr+b_shr>=0
Be aware that using smaller values than strictly necessary for b_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Also, if a larger exponent is used than necessary, a larger depth parameter (see vect_s32_sqrt()) will be required to achieve the same precision, as the results are computed bit by bit, starting with the most significant bit.
Obtain the output exponent and input shifts required to perform a binary add-like operation.
This function computes the output exponent and input shifts required for BFP operations which take two vectors as input, where the operation is “add-like”.
Here, “add-like” operations are loosely defined as those which require input vectors to share an exponent before their mantissas can be meaningfully used to perform that operation.
For example, consider adding \( 3 \cdot 2^{x} + 4 \cdot 2^{y} \). If \(x = y\), then the mantissas can be added directly to get a meaningful result \( (3+4) \cdot 2^{x} \). If \(x \ne y\) however, adding the mantissas together is meaningless. Before the mantissas can be added in this case, one or both of the input mantissas must be shifted so that the representations correspond to the same exponent. Likewise, similar logic applies to binary comparisons.
This is in contrast to a “multiply-like” operation, which does not have this same requirement (e.g. \(a \cdot 2^x \cdot b \cdot 2^y = ab \cdot 2^{x+y}\), regardless of whether \(x=y\)).
\(\bar b\) and \(\bar c\) are the input mantissa vectors with exponents \(b\_exp\) and \(c\_exp\), which are shared by each element of their respective vectors. \(\bar a\) is the output mantissa vector with exponent \(a\_exp\). Two additional properties, \(b\_hr\) and \(c\_hr\), which are the headroom of mantissa vectors \(\bar b\) and \(\bar c\) respectively, are required by this function.
In addition to \(a\_exp\), this function computes \(b\_shr\) and \(c\_shr\), signed arithmetic right-shifts applied to the mantissa vectors \(\bar b\) and \(\bar c\) so that the add-like \(\oplus\) operation can be applied.
This function chooses \(a\_exp\) to be the minimum exponent which can be used to express both \(\bar B\) and \(\bar C\) without saturation of their mantissas, and which leaves both \(\bar b\) and \(\bar c\) with at least extra_operand_hr bits of headroom. The shifts \(b\_shr\) and \(c\_shr\) are derived from \(a\_exp\) using \(b\_exp\) and \(c\_exp\).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr and c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
If \(b\_hr\) or \(c\_hr\) are unknown, they can be calculated using the appropriate headroom function (e.g. vect_complex_s16_headroom() for complex 16-bit vectors) or the value 0 can always be safely used (but may result in reduced precision).
b_shr – [out] Signed arithmetic right-shift to be applied to elements of \(\bar b\). Used by the function which computes the output mantissas \(\bar a\)
c_shr – [out] Signed arithmetic right-shift to be applied to elements of \(\bar c\). Used by the function which computes the output mantissas \(\bar a\)
b_exp – [in] Exponent of BFP vector \(\bar b\)
c_exp – [in] Exponent of BFP vector \(\bar c\)
b_hr – [in] Headroom of BFP vector \(\bar b\)
c_hr – [in] Headroom of BFP vector \(\bar c\)
extra_operand_hr – [in] The minimum amount of headroom that will be left in the mantissa vectors following the arithmetic right-shift, as required by some operations.
This function is used in conjunction with vect_complex_s16_macc() to perform an element-wise multiply-accumlate of complex 16-bit BFP vectors.
This function computes new_acc_exp and acc_shr and bc_sat, which are selected to maximize precision in the resulting accumulator vector without causing saturation of final or intermediate values. Normally the caller will pass these outputs to their corresponding inputs of vect_complex_s16_macc().
acc_exp is the exponent associated with the accumulator mantissa vector \(\bar a\) prior to the operation, whereas new_acc_exp is the exponent corresponding to the updated accumulator vector.
b_exp and c_exp are the exponents associated with the complex input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
acc_hr, b_hr and c_hr are the headrooms of \(\bar a\), \(\bar b\) and \(\bar c\) respectively. If the headroom of any of these vectors is unknown, it can be obtained by calling vect_complex_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the acc_shr and bc_sat produced by this function can be adjusted according to the following:
// Presumed to be set somewhereexponent_tacc_exp,b_exp,c_exp;headroom_tacc_hr,b_hr,c_hr;exponent_tdesired_exp;...// Call prepareright_shift_tacc_shr,bc_sat;vect_complex_s16_macc_prepare(&acc_exp,&acc_shr,&bc_sat,acc_exp,b_exp,c_exp,acc_hr,b_hr,c_hr);// Modify resultsright_shift_tmant_shr=desired_exp-acc_exp;acc_exp+=mant_shr;acc_shr+=mant_shr;bc_sat+=mant_shr;// acc_shr and bc_sat may now be used in a call to vect_complex_s16_macc()
When applying the above adjustment, the following conditions should be maintained:
bc_sat>=0 (bc_sat is an unsigned right-shift)
acc_shr>-acc_hr (Shifting any further left may cause saturation)
It is up to the user to ensure any such modification does not result in saturation or unacceptable loss of precision.
This function is used in conjunction with vect_complex_s16_mul() to perform a complex element-wise multiplication of two complex 16-bit BFP vectors.
This function computes a_exp and a_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the input vectors.
a_shr is the shift parameter required by vect_complex_s16_mul() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_complex_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the a_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_a_shr=a_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
new_a_shr>=0
Be aware that using smaller values than strictly necessary for a_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT16_MIN will instead saturate to -INT16_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
This function is used in conjunction with vect_complex_s16_real_mul() to perform a complex element-wise multiplication of a complex 16-bit BFP vector by a real 16-bit vector.
This function computes a_exp and a_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the input vectors.
a_shr is the shift parameter required by vect_complex_s16_real_mul() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_complex_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the a_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_a_shr=a_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
new_a_shr>=0
Be aware that using smaller values than strictly necessary for a_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT16_MIN will instead saturate to -INT16_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
This function is used in conjunction with vect_complex_s16_squared_mag() to compute the squared magnitude of each element of a complex 16-bit BFP vector.
This function computes a_exp and a_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and is be chosen to maximize precision when elements of \(\bar a\) are computed. The a_exp chosen by this function is derived from the exponent and headroom associated with the input vector.
a_shr is the shift parameter required by vect_complex_s16_mag() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If the headroom of \(\bar b\) is unknown it can be calculated using vect_complex_s16_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the a_shr produced by this function can be adjusted according to the following:
exponent_ta_exp;right_shift_ta_shr;vect_s16_mul_prepare(&a_exp,&a_shr,b_exp,c_exp,b_hr,c_hr);exponent_tdesired_exp=...;// Value known a prioria_shr=a_shr+(desired_exp-a_exp);a_exp=desired_exp;
When applying the above adjustment, the following condition should be maintained:
a_shr>=0
Using larger values than strictly necessary for a_shr may result in unnecessary underflows or loss of precision.
This function is used in conjunction with vect_complex_s32_macc() to perform an element-wise multiply-accumlate of 32-bit BFP vectors.
This function computes new_acc_exp, acc_shr, b_shr and c_shr, which are selected to maximize precision in the resulting accumulator vector without causing saturation of final or intermediate values. Normally the caller will pass these outputs to their corresponding inputs of vect_complex_s32_macc().
acc_exp is the exponent associated with the accumulator mantissa vector \(\bar a\) prior to the operation, whereas new_acc_exp is the exponent corresponding to the updated accumulator vector.
b_exp and c_exp are the exponents associated with the complex input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
acc_hr, b_hr and c_hr are the headrooms of \(\bar a\), \(\bar b\) and \(\bar c\) respectively. If the headroom of any of these vectors is unknown, it can be obtained by calling vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the acc_shr and bc_sat produced by this function can be adjusted according to the following:
// Presumed to be set somewhereexponent_tacc_exp,b_exp,c_exp;headroom_tacc_hr,b_hr,c_hr;exponent_tdesired_exp;...// Call prepareright_shift_tacc_shr,b_shr,c_shr;vect_complex_s32_macc_prepare(&acc_exp,&acc_shr,&b_shr,&c_shr,acc_exp,b_exp,c_exp,acc_hr,b_hr,c_hr);// Modify resultsright_shift_tmant_shr=desired_exp-acc_exp;acc_exp+=mant_shr;acc_shr+=mant_shr;b_shr+=mant_shr;c_shr+=mant_shr;// acc_shr, b_shr and c_shr may now be used in a call to vect_complex_s32_macc()
When applying the above adjustment, the following conditions should be maintained:
acc_shr>-acc_hr (Shifting any further left may cause saturation)
b_shr=>-b_hr (Shifting any further left may cause saturation)
c_shr=>-c_hr (Shifting any further left may cause saturation)
It is up to the user to ensure any such modification does not result in saturation or unacceptable loss of precision.
This function is used in conjunction with vect_complex_s32_mag() to compute the magnitude of each element of a complex 32-bit BFP vector.
This function computes a_exp and b_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and is be chosen to maximize precision when elements of \(\bar a\) are computed. The a_exp chosen by this function is derived from the exponent and headroom associated with the input vector.
b_shr is the shift parameter required by vect_complex_s32_mag() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If the headroom of \(\bar b\) is unknown it can be calculated using vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);
When applying the above adjustment, the following condition should be maintained:
b_hr+b_shr>=0
Using larger values than strictly necessary for b_shr may result in unnecessary underflows or loss of precision.
This function is used in conjunction with vect_complex_s32_mul() to perform a complex element-wise multiplication of two complex 32-bit BFP vectors.
This function computes a_exp, b_shr and c_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the input vectors.
b_shr and c_shr are the shift parameters required by vect_complex_s32_mul() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr and c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT32_MIN will instead saturate to -INT32_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
This function is used in conjunction with vect_complex_s32_real_mul() to perform a the element-wise multiplication of complex 32-bit BFP vector by a real 32-bit BFP vector.
This function computes a_exp, b_shr and c_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms of associated with the input vectors.
b_shr and c_shr are the shift parameters required by vect_complex_s32_mul() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr and c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT32_MIN will instead saturate to -INT32_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
This function is used in conjunction with vect_complex_s32_scale() to perform a complex multiplication of a complex 32-bit BFP vector by a complex 32-bit scalar.
This function computes a_exp, b_shr and c_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and must be chosen to be large enough to avoid overflow when elements of \(\bar a\) are computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms associated with the input vectors.
b_shr and c_shr are the shift parameters required by vect_complex_s32_mul() to achieve the chosen output exponent a_exp.
b_exp and c_exp are the exponents associated with the input mantissa vectors \(\bar b\) and \(\bar c\) respectively.
b_hr and c_hr are the headroom of \(\bar b\) and \(\bar c\) respectively. If the headroom of \(\bar b\) or \(\bar c\) is unknown, they can be obtained by calling vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr and c_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);right_shift_tnew_c_shr=c_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
c_hr+c_shr>=0
Be aware that using smaller values than strictly necessary for b_shr and c_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Notes
Using the outputs of this function, an output mantissa which would otherwise be INT32_MIN will instead saturate to -INT32_MAX. This is due to the symmetric saturation logic employed by the VPU and is a hardware feature. This is a corner case which is usually unlikely and results in 1 LSb of error when it occurs.
This function is used in conjunction with vect_complex_s32_squared_mag() to compute the squared magnitude of each element of a complex 32-bit BFP vector.
This function computes a_exp and b_shr.
a_exp is the exponent associated with mantissa vector \(\bar a\), and is be chosen to maximize precision when elements of \(\bar a\) are computed. The a_exp chosen by this function is derived from the exponent and headroom associated with the input vector.
b_shr is the shift parameter required by vect_complex_s32_mag() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If the headroom of \(\bar b\) is unknown it can be calculated using vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);
When applying the above adjustment, the following condition should be maintained:
b_hr+b_shr>=0
Using larger values than strictly necessary for b_shr may result in unnecessary underflows or loss of precision.
This function is used in conjunction with vect_complex_s32_sum() to compute the sum of elements of a complex 32-bit BFP vector.
This function computes a_exp and b_shr.
a_exp is the exponent associated with the 64-bit mantissa \(a\) returned by vect_complex_s32_sum(), and must be chosen to be large enough to avoid saturation when \(a\) is computed. To maximize precision, this function chooses a_exp to be the smallest exponent known to avoid saturation (see exception below). The a_exp chosen by this function is derived from the exponents and headrooms associated with the input vector.
b_shr is the shift parameter required by vect_complex_s32_sum() to achieve the chosen output exponent a_exp.
b_exp is the exponent associated with the input mantissa vector \(\bar b\).
b_hr is the headroom of \(\bar b\). If the headroom of \(\bar b\) is unknown it can be calculated using vect_complex_s32_headroom(). Alternatively, the value 0 can always be safely used (but may result in reduced precision).
length is the number of elements in the input mantissa vector \(\bar b\).
Adjusting Output Exponents
If a specific output exponent desired_exp is needed for the result (e.g. for emulating fixed-point arithmetic), the b_shr produced by this function can be adjusted according to the following:
exponent_tdesired_exp=...;// Value known a prioriright_shift_tnew_b_shr=b_shr+(desired_exp-a_exp);
When applying the above adjustment, the following conditions should be maintained:
b_hr+b_shr>=0
Be aware that using smaller values than strictly necessary for b_shr can result in saturation, and using larger values may result in unnecessary underflows or loss of precision.
Compute the inner product between two vector chunks.
This function computes the inner product of two vector chunks, \(\bar b\) and \(\bar c\).
Conceptually, elements of \(\bar b\) may have any number of fractional bits (int, fixed-point, mantissas of a BFP vector) so long as they’re all the same. Elements of \(\bar c\) are Q2.30 fixed-point values. Given that, the returned value \(a\) will have the same number of fractional bits as \(\bar b\).
Only the lowest 32 bits of the sum \(a\) are returned.
Compute the natural log of a vector chunk of 32-bit values.
This function computes the natural logarithm of each of the 8 elements in vector chunk \(\bar b\). The result is returned as an 8-element chunk \(\bar a\) of Q8.24 values.
b_exp is the exponent associated with elements of \(\bar b\).
Any input \(b_k \le 0\) will result in a corresponding output \(a_k = \mathtt{INT32_MIN}\).
Compute the natural log of a vector chunk of float_s32_t.
This function computes the natural logarithm of each of the VPU_INT32_EPV elements in vector chunk \(\bar b\). The result is returned as an 8-element chunk \(\bar a\) of Q8.24 values.
Any input \(b_k \le 0\) will result in a corresponding output \(a_k = \mathtt{INT32_MIN}\).
Compute a power series on a vector chunk of Q2.30 values.
This function is used to compute a power series summation on a vector chunk (VPU_INT32_EPV-element vector) \(\bar b\). \(\bar b\) contains Q2.30 values. \(\bar c\) is a vector containing coefficients to be multiplied by powers of \(\bar b\), and may have any associated exponent. The output is vector chunk \(\bar a\) and has the same exponent as \(\bar c\).
c[] is an array with shape (term_count,VPU_INT32_EPV), where the second axis contains the same value replicated across all VPU_INT32_EPV elements. That is, c[k][i]=c[k][j] for i and j in 0..(VPU_INT32_EPV-1). This is for performance reasons. (For the purpose of this explanation, \(\bar c\) is considered to be single-dimensional, without redundancy.)
Compute \(e^b\) on a vector chunk of Q2.30 values.
This function computes \(e^{b_k}\) for each element of a vector chunk (VPU_INT32_EPV-element vector) \(\bar b\) of Q2.30 values near \(0\). The result is computed using the power series approximation of \(e^x\) near zero. It is recommended that this function only be used for \( -0.5 \le b_k\cdot{}2^{-30} \le 0.5\).
The output vector chunk \(\bar a\) is also in a Q2.30 format.
Convert fixed-point value to double-precision float.
This macro is meant to allow for parameterized access to the more specific conversion macros, such as F8(), F24(), F31() and so on. Being parameterized allows the user to specify the Q-format (fractional bit count) using another macro. For example:
Convert floating-point value to fixed-point value.
This macro is meant to allow for parameterized access to the more specific conversion macros, such as Q8(), Q24(), Q31() and so on. Being parameterized allows the user to specify the Q-format (fractional bit count) using another macro. For example:
The number of leading sign bits for a complex integer is defined as the minimum of the number of leading sign bits for its real part and for its imaginary part.
The number of leading sign bits for a complex integer is defined as the minimum of the number of leading sign bits for its real part and for its imaginary part.
This function returns takes an integer index and reverses the bits least-significant bits to form a new integer which is returned. All more significant bits are ignored.
This is useful for algorithms, such as the FFT, whose implementation requires reordering of elements by reversing the bits of the indices.
Parameters:
index – [in] Input value
bits – [in] The number of least-significant bits to reverse.
Indicates whether the BFP functions should check vector lengths for errors.
Iff true, BFP functions will check (assert()) to ensure that each BFP vector argument does not violate any length constraints. Most often this simply ensures that, where BFP functions take multiple vectors as parameters, each of the vectors has the same length.
Defaults to false (0).
XMATH_BFP_SQRT_DEPTH_S16
The number of most significant bits which are computed by bfp_s16_sqrt().
The function bfp_sqrt_s16() computes results one bit at a time, starting with bit 14 (the second-to-most significant bit). Because this is a relatively expensive operation, it may be desirable to trade off precision of results for a speed-up.
The time cost of bfp_sqrt_s16() is approximately linear with respect to the depth.
The number of most significant bits which are computed by bfp_s32_sqrt().
The function bfp_sqrt_s32() computes results one bit at a time, starting with bit 30 (the second-to-most significant bit). Because this is a relatively expensive operation, it may be desirable to trade off precision of results for a speed-up.
The time cost of bfp_sqrt_s32() is approximately linear with respect to the depth.
This library makes use of the XMOS architecture’s vector processing unit (VPU). All loads and stores to and from the XS3 VPU have the requirement that the loaded/stored addresses must be aligned to a 4-byte boundary (word-aligned).
In the current version of the API, this leads to the requirement that most API functions require vectors (or the data backing a BFP vector) to begin at word-aligned addresses. Vectors are not required, however, to have a size (in bytes) that is a multiple of 4.
Writing Alignment-safe Code
The alignment requirement is ultimately always on the data that backs a vector. For the low-level API, that is the pointers passed to the functions themselves. For the high-level API, that is the memory to which the data field (or the real and imag fields in the case of bfp_complex_s16_t) points, specified when the BFP vector is initialized.
Arrays of type int32_t and complex_s32_t will normally be guaranteed to be word-aligned by the compiler. However, if the user manually specifies the beginning of an int32_t array, as in the following..
.. the vector may not be word-aligned. It is the responsibility of the user to ensure proper alignment of data.
For int16_t arrays, the compiler does not by default guarantee that the array starts on a word-aligned address. To force word-alignment on arrays of this type, use __attribute__((aligned(4))) in the variable definition, as in the following.
int16_t__attribute__((aligned(4)))data[100];
Occasionally, 8-byte (double word) alignment is required. In this case, neither int32_t nor int16_t is necessarily guaranteed to align as required. Similar to the above, this can be hinted to the compiler as in the following.
int32_t__attribute__((aligned(8)))data[100];
This library also provides the macros WORD_ALIGNED and DWORD_ALIGNED which force 4- and 8-byte alignment respectively as above.
With ordinary integer arithmetic the block floating-point logic chooses exponents and operand shifts to prevent integer overflow with worst-case input values. However, the XS3 VPU uses symmetrically saturating integer arithmetic.
Saturating arithmetic is that where partial results of the applied operation use a bit depth greater than the output bit depth, and values that can’t be properly expressed with the output bit depth are set to the nearest expressible value.
For example, in ordinary C integer arithmetic, a function which multiplies two 32-bit integers may internally compute the full 64-bit product and then clamp values to the range (INT32_MIN,INT32_MAX) before returning a 32-bit result.
Symmetrically saturating arithmetic also includes the property that the lower bound of the expressible range is the negative of the upper bound of the expressible range.
One of the major troubles with non-saturating integer arithmetic is that in a twos complement encoding, there exists a non-zero integer (e.g. INT16_MIN in 16-bit twos complement arithmetic) value \(x\) for which \(-1 \cdot x = x\). Serious arithmetic errors can result when this case is not accounted for.
One of the results of symmetric saturation, on the other hand, is that there is a corner case where (using the same exponent and shift logic as non-saturating arithmetic) saturation may occur for a particular combination of input mantissas. The corner case is different for different operations.
When the corner case occurs, the minimum (and largest magnitude) value of the resulting vector is 1 LSb greater than its ideal value (e.g. -0x3FFF instead of -0x4000 for 16-bit arithmetic). The error in this output element’s mantissa is then 1 LSb, or \(2^p\), where \(p\) is the exponent of the resulting BFP vector.
Of course, the very nature of BFP arithmetic routinely involves errors of this magnitude.
In its general form, the \(N\)-point Discrete Fourier Transform is an operation applied to a complex \(N\)-point signal \(x[n]\) to produce a complex spectrum \(X[f]\). Any spectrum \(X[f]\) which is the result of a \(N\)-point DFT has the property that \(X[f+N] = X[f]\). Thus, the complete representation of the \(N\)-point DFT of \(X[n]\) requires \(N\) complex elements.
Complex DFT and IDFT
In this library, when performing a complex DFT (e.g. using fft_bfp_forward_complex()), the spectral representation that results in a straight-forward mapping:
X[f]\(\longleftarrow X[f]\) for \(0 \le f < N\)
where X is an \(N\)-element array of complex_s32_t, where the real part of \(X[f]\) is in X[f].re and the imaginary part in X[f].im.
Likewise, when performing an \(N\)-point complex inverse DFT, that is also the representation that is expected.
Real DFT and IDFT
Oftentimes we instead wish to compute the DFT of real signals. In addition to the periodicity property ( \(X[f+N] = X[f]\)), the DFT of a real signal also has a complex conjugate symmetry such that \(X[-f] = X^*[f]\), where \(X^*[f]\) is the complex conjugate of \(X[f]\). This symmetry makes it redundant (and thus undesirable) tostore such symmetric pairs of elements. This would allow us to get away with only explicitly storing \(X[f\) for \(0 \le f \le N/2\) in \((N/2)+1\) complex elements.
Unfortunately, using such a representation has the undesirable property that the DFT of an \(N\)-point real signal cannot be computed in-place, as the representation requires more memory than we started with.
However, if we take the periodicity and complex conjugate symmetry properties together:
Because both \(X[0]\) and \(X[N/2]\) are guaranteed to be real, we can recover the benefit of in-place computation in our representation by packing the real part of \(X[N/2]\) into the imaginary part of \(X[0]\).
Therefore, the functions in this library that produce the spectra of real signals (such as fft_bfp_forward_mono() and fft_bfp_forward_stereo()) will pack the spectra in a slightly less straight-forward manner (as compared with the complex DFTs):
X[f]\(\longleftarrow X[f]\) for \(1 \le f < N/2\)
X[0]\(\longleftarrow X[0] + j X[N/2]\)
where X is an \(N/2\)-element array of complex_s32_t.
Likewise, this is the encoding expected when computing the \(N\)-point inverse DFT, such as by fft_bfp_inverse_mono() or fft_bfp_inverse_stereo().
Note
One additional note, when performing a stereo DFT or inverse DFT, so as to preserve the in-place computation of the result, the spectra of the two signals will be encoded into adjacent blocks of memory, with the second spectrum (i.e. associated with ‘channel b’) occupying the higher memory address.
When computing DFTs this library relies on one or both of a pair of look-up tables which contain portions of the Discrete Fourier Transform matrix. Longer FFT lengths require larger look-up tables. When building using CMake, the maximum FFT length can be specified as a CMake option, and these tables are auto-generated at build time.
If not using CMake, you can manually generate these files using a python script included with the library. The script is located at lib_xcore_math/python/gen_fft_table.py. If generated manually, you must add the generated .c file as a source, and the directory containing xmath_fft_lut.h must be added as an include directory when compiling the library’s files.
Note that the header file must be named xmath_fft_lut.h as it is included via #include"xmath_fft_lut.h".
By default the tables contain the coefficients necessary to perform forward or inverse DFTs of up to 1024 points. If larger DFTs are required, or if the maximum required DFT size is known to be less than 1024 points, the MAX_FFT_LEN_LOG2 CMake option can be modified from its default value of 10.
The two look-up tables correspond to the decimation-in-time and decimation-in-frequency FFT algorithms, and the run-time symbols for the tables are xmath_dit_fft_lut and xmath_dif_fft_lut respectively. Each table contains \(N-4\) complex 32-bit values, with a size of \(8\cdot (N-4)\) bytes each.
To manually regenerate the tables for amaximum FFT length of \(16384\) ( \(=2^{14}\)), supporting only the decimation-in-time algorithm, for example, use the following:
This library supports optimized implementations of 16- and 32-bit FIR filters, as well as cascaded 32-bit biquad filters. Each of these filter implementations requires that the filter coefficients be represented in a compatible form.
To assist with that, several python scripts are distributed with this library which can be used to convert existing floating-point filter coefficients into a code which is easily callable from within an xCore application.
Each script reads in floating-point filter coefficients from a file and computes a new representation for the filter with coefficients which attempt to maximize precision and are compatible with the lib_xcore_math filtering API.
Each script outputs two files which can be included in your own xCore application. The first output is a C source (.c) file containing the computed filter parameters and several function definitions for initializing and executing the generated filter. The second output is a C header (.h) file which can be #included into your own application to give access to those functions.
Additionally, each script also takes a user-provided filter name as an input parameter. The output files (as well as the function names within) include the filter name so that more than one filter can be generated and executed using this mechanism.
As an example, take the following command to generate a 32-bit FIR filter:
This command creates a filter named “MyFilter”, with coefficients taken from a file filter_coefs.txt. Two output files will be generated, MyFilter.c and MyFilter.h. Including MyFilter.h provides access to 3 functions, MyFilter_init(), MyFilter_add_sample(), and MyFilter() which correspond to the library functions filter_fir_s32_init(), filter_fir_s32_add_sample() and filter_fir_s32() respectively.
Use the --help flag with the scripts for more detailed descriptions of inputs and other options.
Several example applications are offered to demonstrate use of the lib_xcore_math APIs through
simple code examples.
app_bfp_demo - Demonstration of the block floating-point arithmetic API
app_vect_demo - Demonstration of the low-level vectorized arithmetic API
app_fft_demo - Demonstration of the Fast Fourier Transform API
app_filter_demo - Demonstration of the filtering API
This section assumes you have downloaded and installed the XMOS XTC tools
(see README for required version).
Installation instructions can be found here.
The purpose of this example application is to demonstrate how the arithmetic functions of
lib_xcore_math’s block floating-point API may be used.
In it, three 32-bit BFP vectors are allocated, initialized and filled with random data. Then several
BFP operations are applied using those vectors as inputs and/or outputs.
The example only demonstrates the real 32-bit arithmetic BFP functions (that is, functions with
names bfp_s32_*). The real 16-bit (bfp_s16_*), complex 32-bit (bfp_complex_s32_*) and
complex 16-bit (bfp_complex_s16_*) functions all use similar naming conventions.
The purpose of this example application is to demonstrate how the arithmetic functions of
lib_xcore_math’s lower-level vector API may be used.
In general the low-level arithmetic API are the functions in this library whose names begin with
vect_*, such as vect_s32_mul() for element-wise multiplication of 32-bit vectors, and
vect_complex_s16_scale() for multiplying a complex 16-bit vector by a complex scalar.
We assume that where the low-level API is being used it is because some behavior other than the
default behavior of the high-level block floating-point API is required. Given that, rather than
showcasing the breadth of operations available, this example examines first how to achieve
comparable behavior to the BFP API, and then ways in which that behavior can be modified.