Library Notes

Note: Vector Alignment

This library utilizes the XMOS XS3 architecture’s vector processing unit (VPU). All loads and stores to and from the XS3 VPU have the requirement that the loaded/stored addresses must be aligned to a 4-byte boundary (word-aligned).

In the current version of the API, this leads to the requirement that most API functions require vectors (or the data backing a BFP vector) to begin at word-aligned addresses. Vectors are not required, however, to have a size (in bytes) that is a multiple of 4.

Writing Alignment-safe Code

The alignment requirement is ultimately always on the data that backs a vector. For the low-level API, that is the pointers passed to the functions themselves. For the high-level API, that is the memory to which the data field (or the real and imag fields in the case of bfp_complex_s16_t) points, specified when the BFP vector is initialized.

Arrays of type int32_t and complex_s32_t will normally be guaranteed to be word-aligned by the compiler. However, if the user manually specifies the beginning of an int32_t array, as in the following..

uint8_t byte_buffer[100];
int32_t* integer_array = (int32_t*) &byte_buffer[1];

.. it is the responsibility of the user to ensure proper alignment of data.

For int16_t arrays, the compiler does not by default guarantee that the array starts on a word-aligned address. To force word-alignment on arrays of this type, use __attribute__((align 4)) in the variable definition, as in the following.

int16_t __attribute__((align 4)) data[100];

Occasionally, 8-byte (double word) alignment is required. In this case, neither int32_t nor int16_t is necessarily guaranteed to align as required. Similar to the above, this can be hinted to the compiler as in the following.

int32_t __attribute__((align 8)) data[100];

Note: Symmetrically Saturating Arithmetic

With ordinary integer arithmetic the block floating-point logic chooses exponents and operand shifts to prevent integer overflow with worst-case input values. However, the XS3 VPU utilizes symmetrically saturating integer arithmetic.

Saturating arithmetic is that where partial results of the applied operation use a bit depth greater than the output bit depth, and values that can’t be properly expressed with the output bit depth are set to the nearest expressible value.

For example, in ordinary C integer arithmetic, a function which multiplies two 32-bit integers may internally compute the full 64-bit product and then clamp values to the range (INT32_MIN, INT32_MAX) before returning a 32-bit result.

Symmetrically saturating arithmetic also includes the property that the lower bound of the expressible range is the negative of the upper bound of the expressible range.

One of the major troubles with non-saturating integer arithmetic is that in a twos complement encoding, there exists a non-zero integer (e.g. INT16_MIN in 16-bit twos complement arithmetic) value \(x\) for which \(-1 \cdot x = x\). Serious arithmetic errors can result when this case is not accounted for.

One of the results of symmetric saturation, on the other hand, is that there is a corner case where (using the same exponent and shift logic as non-saturating arithmetic) saturation may occur for a particular combination of input mantissas. The corner case is different for different operations.

When the corner case occurs, the minimum (and largest magnitude) value of the resulting vector is 1 LSb greater than its ideal value (e.g. -0x3FFF instead of -0x4000 for 16-bit arithmetic). The error in this output element’s mantissa is then 1 LSb, or \(2^p\), where \(p\) is the exponent of the resulting BFP vector.

Of course, the very nature of BFP arithmetic routinely involves errors of this magnitude.

Note: Spectrum Packing

In its general form, the \(N\)-point Discrete Fourier Transform is an operation applied to a complex \(N\)-point signal \(x[n]\) to produce a complex spectrum \(X[f]\). Any spectrum \(X[f]\) which is the result of a \(N\)-point DFT has the property that \(X[f+N] = X[f]\). Thus, the complete representation of the \(N\)-point DFT of \(X[n]\) requires \(N\) complex elements.

Complex DFT and IDFT

In this library, when performing a complex DFT (e.g. using bfp_fft_forward_complex()), the spectral representation that results in a straight-forward mapping:

X[f] \(\longleftarrow X[f]\) for \(0 \le f \lt N\)

where X is an \(N\)-element array of complex_s32_t, where the real part of \(X[f]\) is in X[f].re and the imaginary part in X[f].im.

Likewise, when performing an \(N\)-point complex inverse DFT, that is also the representation that is expected.

Real DFT and IDFT

Oftentimes we instead wish to compute the DFT of real signals. In addition to the periodicity property ( \(X[f+N] = X[f]\)), the DFT of a real signal also has a complex conjugate symmetry such that \(X[-f] = X^*[f]\), where \(X^*[f]\) is the complex conjugate of \(X[f]\). This symmetry makes it redundant (and thus undesirable) tostore such symmetric pairs of elements. This would allow us to get away with only explicitly storing \(X[f\) for \(0 \le f \le N/2\) in \((N/2)+1\) complex elements.

Unfortunately, using such a representation has the undesirable property that the DFT of an \(N\)-point real signal cannot be computed in-place, as the representation requires more memory than we started with.

However, if we take the periodicity and complex conjugate symmetry properties together:

\[\begin{split} & X[0] = X^*[0] \rightarrow Imag\{X[0]\} = 0 \\ & X[-(N/2) + N] = X[N/2] \\ & X[-N/2] = X^*[N/2] \rightarrow X[N/2] = X^*[N/2] \rightarrow Imag \{ X[N/2] \} = 0 \end{split}\]

Because both \(X[0]\) and \(X[N/2]\) are guaranteed to be real, we can recover the benefit of in-place computation in our representation by packing the real part of \(X[N/2]\) into the imaginary part of \(X[0]\).

Therefore, the functions in this library that produce the spectra of real signals (such as bfp_fft_forward_mono() and bfp_fft_forward_stereo()) will pack the spectra in a slightly less straight-forward manner (as compared with the complex DFTs):

X[f] \(\longleftarrow X[f]\) for \(1 \le f \lt N/2\)

X[0] \(\longleftarrow X[0] + j X[N/2]\)

where X is an \(N/2\)-element array of complex_s32_t.

Likewise, this is the encoding expected when computing the \(N\)-point inverse DFT, such as by bfp_fft_inverse_mono() or bfp_fft_inverse_stereo().


One additional note, when performing a stereo DFT or inverse DFT, so as to preserve the in-place computation of the result, the spectra of the two signals will be encoded into adjacent blocks of memory, with the second spectrum (i.e. associated with ‘channel b’) occupying the higher memory address.

Note: Library FFT Length Support

When computing DFTs this library relies on one or both of a pair of look-up tables which contain portions of the Discrete Fourier Transform matrix. Longer FFT lengths require larger look-up tables. When building using CMake, the maximum FFT length can be specified as a CMake option, and these tables are auto-generated at build time.

If not using CMake, you can manually generate these files using a python script included with the library. The script is located at lib_xs3_math/scripts/ If generated manually, you must add the generated .c file as a source, and the directory containing xs3_fft_lut.h must be added as an include directory when compiling the library’s files.

Note that the header file must be named xs3_fft_lut.h as it is included via #include "xs3_fft_lut.h".

By default the tables contain the coefficients necessary to perform forward or inverse DFTs of up to 1024 points. If larger DFTs are required, or if the maximum required DFT size is known to be less than 1024 points, the MAX_FFT_LEN_LOG2 CMake option can be modified from its default value of 10.

The two look-up tables correspond to the decimation-in-time and decimation-in-frequency FFT algorithms, and the run-time symbols for the tables are xs3_dit_fft_lut and xs3_dif_fft_lut respectively. Each table contains \(N-4\) complex 32-bit values, with a size of \(8\cdot (N-4)\) bytes each.

To manually regenerate the tables for amaximum FFT length of \(16384\) ( \(=2^{14}\)), supporting only the decimation-in-time algorithm, for example, use the following:

python lib_xs3_math/script/ --dit --max_fft_log2 14

Use the --help flag with for a more detailed description of its syntax and parameters.

Note: Digital Filter Conversion

This library supports optimized implementations of 16- and 32-bit FIR filters, as well as cascaded 32-bit biquad filters. Each of these filter implementations requires that the filter coefficients be represented in a compatible form.

To assist with that, several python scripts are distributed with this library which can be used to convert existing floating-point filter coefficients into a code which is easily callable from within an xCore application.

Each script reads in floating-point filter coefficients from a file and computes a new representation for the filter with coefficients which attempt to maximize precision and are compatible with the lib_xs3_math filtering API.

Each script outputs two files which can be included in your own xCore application. The first output is a C source (.c) file containing the computed filter parameters and several function definitions for initializing and executing the generated filter. The second output is a C header (.h) file which can be #included into your own application to give access to those functions.

Additionally, each script also takes a user-provided filter name as an input parameter. The output files (as well as the function names within) include the filter name so that more than one filter can be generated and executed using this mechanism.

As an example, take the following command to generate a 32-bit FIR filter:

python lib_xs3_math/script/ MyFilter filter_coefs.txt

This command creates a filter named “MyFilter”, with coefficients taken from a file filter_coefs.txt. Two output files will be generated, MyFilter.c and MyFilter.h. Including MyFilter.h provides access to 3 functions, MyFilter_init(), MyFilter_add_sample(), and MyFilter() which correspond to the library functions xs3_filter_fir_s32_init(), xs3_filter_fir_s32_add_sample() and xs3_filter_fir_s32() respectively.

Use the --help flag with the scripts for more detailed descriptions of inputs and other options.

Filter Type


32-bit FIR


16-bit FIR


32-bit Biquad