32-Bit IEEE 754 Float API#

group vect_f32_api

Functions

complex_float_t *fft_f32_forward(float x[], const unsigned fft_length)#

Perform forward FFT on a vector of IEEE754 floats.

This function takes real input vector \(\bar x\) and performs a forward FFT on the signal in-place to get output vector \(\bar{X} = FFT{\bar{x}}\). This implementation is accelerated by converting the IEEE754 float vector into a block floating-point representation to compute the FFT. The resulting BFP spectrum is then converted back to IEEE754 single-precision floats. The operation is performed in-place on x[].

See bfp_fft_forward_mono() for the details of the FFT.

Whereas the input x[] is an array of fft_length float elements, the output (placed in x[]) is an array of fft_length/2 complex_float_t elements, so the input should be cast after calling this.

const unsigned FFT_N = 512
float time_series[FFT_N] = { ... };
fft_f32_forward(time_series, FFT_N);
complex_float_t* freq_spectrum = (complex_float_t*) &time_series[0];
const unsigned FREQ_BINS = FFT_N/2;
// e.g.   freq_spectrum[FREQ_BINS-1].re

x[] must begin at a double-word-aligned address.

Operation Performed

\[\begin{aligned} & \bar{X} \leftarrow FFT{\bar{x}} \end{aligned}\]

Parameters:

x – [inout] Input vector \(\bar x\)
fft_length – [in] The length of \(\bar x\)

Throws ET_LOAD_STORE:

Raised if x is not double-word-aligned (See Note: Vector Alignment)

Returns:

Pointer to frequency-domain spectrum (i.e. ((complex_float_t*) &x[0]))

float *fft_f32_inverse(complex_float_t X[], const unsigned fft_length)#

Perform inverse FFT on a vector of complex_float_t.

This function takes complex input vector \(\bar X\) and performs an inverse real FFT on the spectrum in-place to get output vector \(\bar{x} = IFFT{\bar{X}}\). This implementation is accelerated by converting the IEEE754 float vector into a block floating-point representation to compute the IFFT. The resulting BFP signal is then converted back to IEEE754 single-precision floats. The operation is performed in-place on X[].

See bfp_fft_inverse_mono() for the details of the IFFT.

Input X[] is an array of fft_length/2 complex_float_t elements. The output (placed in X[]) is an array of fft_length float elements.

const unsigned FFT_N = 512
complex_float_t freq_spectrum[FFT_N/2] = { ... };
fft_f32_inverse(freq_spectrum, FFT_N);
float* time_series = (float*) &freq_spectrum[0];

X[] must begin at a double-word-aligned address.

Parameters:

X – [inout] Input vector \(\bar X\)
fft_length – [in] The FFT length. Twice the element count of \(\bar X\).

Throws ET_LOAD_STORE:

Raised if X is not double-word-aligned (See Note: Vector Alignment)

Returns:

Pointer to time-domain signal (i.e. ((float*) &X[0]))

exponent_t vect_f32_max_exponent(const float b[], const unsigned length)#

Get the maximum (32-bit BFP) exponent from a vector of IEEE754 floats.

This function is used to determine the BFP exponent to use when converting a vector of IEEE754 single-precision floats into a 32-bit BFP vector.

The exponent returned, if used with vect_f32_to_vect_s32(), is the one which will result in no headroom in the BFP vector — that is, the minimum permissible exponent for the BFP vector. The minimum permissible exponent is derived from the maximum exponent found in the float elements themselves.

More specifically, the FSEXP instruction is used on each element to determine its exponent. The value returned is the maximum exponent given by the FSEXP instruction plus 30.

b[] must begin at a double-word-aligned address.

See also

vect_f32_to_vect_s32

See also

vect_s32_to_vect_f32

Note

If required, when converting to a 32-bit BFP vector, additional headroom can be included by adding the amount of required headroom to the exponent returned by this function.

Parameters:

b – [in] Input vector of IEEE754 single-precision floats \(\bar b\)
length – [in] Number of elements in \(\bar b\)

Throws ET_LOAD_STORE:

Raised ifb is not double-word-aligned (See Note: Vector Alignment)

Throws ET_ARITHMETIC:

Raised if Any element of b is infinite or not-a-number.

Returns:

Exponent used for converting to 32-bit BFP vector.

void vect_f32_to_vect_s32(int32_t a[], const float b[], const unsigned length, const exponent_t a_exp)#

Convert a vector of IEEE754 single-precision floats into a 32-bit BFP vector.

This function converts a vector of IEEE754 single-precision floats \(\bar b\) into the mantissa vector \(\bar a\) of a 32-bit BFP vector, given BFP vector exponent \(a\_exp\). Conceptually, the elements of output vector \(\bar{a} \cdot 2^{a\_exp}\) represent the same values as those of the input vector.

Because the output exponent \(a\_exp\) is shared by all elements of the output vector, even though the output vector has 32-bit mantissas, precision may be lost on some elements if the exponents of the input elements \(b_k\) span a wide range.

The function vect_f32_max_exponent() can be used to determine the value for \(a\_exp\) which minimizes headroom of the output vector.

Operation Performed

\[\begin{split}\begin{aligned} & a_k \leftarrow round(\frac{b_k}{2^{b\_exp}}) \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameter Details

a[] represents the 32-bit output mantissa vector \(\bar a\).

b[] represents the IEEE754 float input vector \(\bar b\).

a[] and b[] must each begin at a double-word-aligned address.

b[] can be safely updated in-place.

length is the number of elements in each of the vectors.

a_exp is the exponent associated with the output vector \(\bar a\).

See also

vect_f32_max_exponent

See also

vect_s32_to_vect_f32

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
a_exp – [in] Exponent \(a\_exp\) of output vector \(\bar a\)

Throws ET_LOAD_STORE:

Raised if a or b is not double-word-aligned (See Note: Vector Alignment)

Throws ET_ARITHMETIC:

Raised if Any element of b is infinite or not-a-number.

float vect_f32_dot(const float b[], const float c[], const unsigned length)#

Compute the inner product of two IEEE754 float vectors.

This function takes two vectors of IEEE754 single-precision floats and computes their inner product — the sum of the elementwise products. The FMACC instruction is used, granting full precision in the addition.

The inner product \(a\) is returned.

Operation Performed

\[\begin{aligned} & a \leftarrow \sum_{k=0}^{length-1} ( b_k \cdot c_k ) \end{aligned}\]

Parameters:

b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar b\) and \(\bar c\)

Returns:

The inner product

void vect_f32_add(float a[], const float b[], const float c[], const unsigned length)#

Adds together two IEEE754 float vectors.

This function takes two vectors of IEEE754 single-precision floats and computes the element-wise sum of the two vectors.

a[] is the output vector \(\bar a\) into which results are placed.

b[] and c[] are the input vectors \(\bar b\) and \(\bar c\) respectively.

a, b and c each must begin at a double-word-aligned address.

This operation can be performed safely in-place on b[] or c[].

Operation Performed

\[\begin{split}\begin{aligned} & a_k \gets b_k + c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)

Throws ET_LOAD_STORE:

Raised if a, b or c is not double-word-aligned (See Note: Vector Alignment)

void vect_complex_f32_add(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#

Adds together two complex IEEE754 float vectors.

This function takes two vectors \(\bar b\) and \(\bar c\) of complex IEEE754 single-precision floats and computes the element-wise sum of the two vectors.

a[] is the output vector \(\bar a\) into which results are placed.

b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.

a, b and c each must begin at a double-word-aligned address.

This operation can be performed safely in-place on b[] or c[].

Operation Performed

\[\begin{split}\begin{aligned} & a_k \gets b_k + c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)

Throws ET_LOAD_STORE:

Raised if a, b or c is not double-word-aligned (See Note: Vector Alignment)

void vect_complex_f32_mul(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#

Multiplies together two complex IEEE754 float vectors.

This function takes two complex float vectors \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as \(b_k\) multiplied by \(c_k\) (using complex multiplication).

a[] is the output vector \(\bar a\) into which results are placed.

b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.

a, b and c each must begin at a double-word-aligned address.

This operation can be performed safely in-place on b[] or c[].

Operation Performed

\[\begin{split}\begin{aligned} & a_k \gets b_k \cdot c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)

Throws ET_LOAD_STORE:

Raised if a, b or c is not double-word-aligned (See Note: Vector Alignment)

void vect_complex_f32_conj_mul(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#

Conjugate multiplies together two complex IEEE754 float vectors.

This function takes two complex float vectors \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as \(b_k\) multiplied by the complex conjugate of \(c_k\) (using complex multiplication).

a[] is the output vector \(\bar a\) into which results are placed.

b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.

a, b and c each must begin at a double-word-aligned address.

This operation can be performed safely in-place on b[] or c[].

Operation Performed

\[\begin{split}\begin{aligned} & a_k \gets b_k \cdot (c_k^*) \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)

Throws ET_LOAD_STORE:

Raised if a, b or c is not double-word-aligned (See Note: Vector Alignment)

void vect_complex_f32_macc(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#

Adds the product of two complex IEEE754 float vectors to a third float vector.

This function takes three complex float vectors \(\bar a\), \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as input \(a_k\) plus \(b_k\) multiplied by \(c_k\).

a[] is accumulator vector \(\bar a\), serving as both input and output.

b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.

a, b and c each must begin at a double-word-aligned address.

Operation Performed

\[\begin{split}\begin{aligned} & a_k \gets a_k + b_k \cdot c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameters:

a – [inout] Input/Output accumulator vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)

Throws ET_LOAD_STORE:

Raised if a, b or c is not double-word-aligned (See Note: Vector Alignment)

void vect_complex_f32_conj_macc(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#

Adds the product of two complex IEEE754 float vectors to a third float vector.

This function takes three complex float vectors \(\bar a\), \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as input \(a_k\) plus \(b_k\) multiplied by the complex conjugate of \(c_k\).

a[] is accumulator vector \(\bar a\), serving as both input and output.

b[] and c[] are the complex input vectors \(\bar b\) and \(\bar c\) respectively.

a, b and c each must begin at a double-word-aligned address.

Operation Performed

\[\begin{split}\begin{aligned} & a_k \gets a_k + b_k \cdot (c_k^*) \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameters:

a – [inout] Input/Output accumulator vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)

Throws ET_LOAD_STORE:

Raised if a, b or c is not double-word-aligned (See Note: Vector Alignment)

void vect_s32_to_vect_f32(float a[], const int32_t b[], const unsigned length, const exponent_t b_exp)#

Convert a 32-bit BFP vector into a vector of IEEE754 single-precision floats.

This function converts a 32-bit mantissa vector and exponent \(\bar b \cdot 2^{b\_exp}\) into a vector of 32-bit IEEE754 single-precision floating-point elements \(\bar a\). Conceptually, the elements of output vector \(\bar a\) represent the same values as those of the input vector.

Because IEEE754 single-precision floats hold fewer mantissa bits, this operation may result in a loss of precision for some elements.

Operation Performed

\[\begin{split}\begin{aligned} & a_k \leftarrow b_k \cdot 2^{b\_exp} \\ & \qquad\text{ for }k\in 0\ ...\ (length-1) \end{aligned}\end{split}\]

Parameter Details

a[] represents the output IEEE754 float vector \(\bar a\).

b[] represents the 32-bit input mantissa vector \(\bar b\).

a[] and b[] must each begin at a double-word-aligned address.

b[] can be safely updated in-place.

length is the number of elements in each of the vectors.

b_exp is the exponent associated with the input vector \(\bar b\).

See also

vect_f32_to_vect_s32

Parameters:

a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_exp – [in] Exponent \(b\_exp\) of input vector \(\bar b\)

Throws ET_LOAD_STORE:

Raised if a or b is not double-word-aligned (See Note: Vector Alignment)