32Bit IEEE 754 Float API#
 group vect_f32_api
Functions

complex_float_t *fft_f32_forward(float x[], const unsigned fft_length)#
Perform forward FFT on a vector of IEEE754 floats.
This function takes real input vector \(\bar x\) and performs a forward FFT on the signal inplace to get output vector \(\bar{X} = FFT{\bar{x}}\). This implementation is accelerated by converting the IEEE754 float vector into a block floatingpoint representation to compute the FFT. The resulting BFP spectrum is then converted back to IEEE754 singleprecision floats. The operation is performed inplace on
x[]
.See
bfp_fft_forward_mono()
for the details of the FFT.Whereas the input
x[]
is an array offft_length
float
elements, the output (placed inx[]
) is an array offft_length/2
complex_float_t
elements, so the input should be cast after calling this.const unsigned FFT_N = 512 float time_series[FFT_N] = { ... }; fft_f32_forward(time_series, FFT_N); complex_float_t* freq_spectrum = (complex_float_t*) &time_series[0]; const unsigned FREQ_BINS = FFT_N/2; // e.g. freq_spectrum[FREQ_BINS1].re
x[]
must begin at a doublewordaligned address. Operation Performed:
 \[\begin{flalign*} & \bar{X} \leftarrow FFT{\bar{x}} && \end{flalign*}\]
 Parameters:
x – [inout] Input vector \(\bar x\)
fft_length – [in] The length of \(\bar x\)
 Throws ET_LOAD_STORE:
Raised if
x
is not doublewordaligned (See Note: Vector Alignment) Returns:
Pointer to frequencydomain spectrum (i.e.
((complex_float_t*) &x[0])
)

float *fft_f32_inverse(complex_float_t X[], const unsigned fft_length)#
Perform inverse FFT on a vector of complex_float_t.
This function takes complex input vector \(\bar X\) and performs an inverse real FFT on the spectrum inplace to get output vector \(\bar{x} = IFFT{\bar{X}}\). This implementation is accelerated by converting the IEEE754 float vector into a block floatingpoint representation to compute the IFFT. The resulting BFP signal is then converted back to IEEE754 singleprecision floats. The operation is performed inplace on
X[]
.See
bfp_fft_inverse_mono()
for the details of the IFFT.Input
X[]
is an array offft_length/2
complex_float_t
elements. The output (placed inX[]
) is an array offft_length
float
elements.const unsigned FFT_N = 512 complex_float_t freq_spectrum[FFT_N/2] = { ... }; fft_f32_inverse(freq_spectrum, FFT_N); float* time_series = (float*) &freq_spectrum[0];
X[]
must begin at a doublewordaligned address. Parameters:
X – [inout] Input vector \(\bar X\)
fft_length – [in] The FFT length. Twice the element count of \(\bar X\).
 Throws ET_LOAD_STORE:
Raised if
X
is not doublewordaligned (See Note: Vector Alignment) Returns:
Pointer to timedomain signal (i.e.
((float*) &X[0])
)

exponent_t vect_f32_max_exponent(const float b[], const unsigned length)#
Get the maximum (32bit BFP) exponent from a vector of IEEE754 floats.
This function is used to determine the BFP exponent to use when converting a vector of IEEE754 singleprecision floats into a 32bit BFP vector.
The exponent returned, if used with
vect_f32_to_vect_s32()
, is the one which will result in no headroom in the BFP vector — that is, the minimum permissible exponent for the BFP vector. The minimum permissible exponent is derived from the maximum exponent found in thefloat
elements themselves.More specifically, the
FSEXP
instruction is used on each element to determine its exponent. The value returned is the maximum exponent given by theFSEXP
instruction plus30
.b[]
must begin at a doublewordaligned address.See also
See also
Note
If required, when converting to a 32bit BFP vector, additional headroom can be included by adding the amount of required headroom to the exponent returned by this function.
 Parameters:
b – [in] Input vector of IEEE754 singleprecision floats \(\bar b\)
length – [in] Number of elements in \(\bar b\)
 Throws ET_LOAD_STORE:
Raised if
b
is not doublewordaligned (See Note: Vector Alignment) Throws ET_ARITHMETIC:
Raised if Any element of
b
is infinite or notanumber. Returns:
Exponent used for converting to 32bit BFP vector.

void vect_f32_to_vect_s32(int32_t a[], const float b[], const unsigned length, const exponent_t a_exp)#
Convert a vector of IEEE754 singleprecision floats into a 32bit BFP vector.
This function converts a vector of IEEE754 singleprecision floats \(\bar b\) into the mantissa vector \(\bar a\) of a 32bit BFP vector, given BFP vector exponent \(a\_exp\). Conceptually, the elements of output vector \(\bar{a} \cdot 2^{a\_exp}\) represent the same values as those of the input vector.
Because the output exponent \(a\_exp\) is shared by all elements of the output vector, even though the output vector has 32bit mantissas, precision may be lost on some elements if the exponents of the input elements \(b_k\) span a wide range.
The function
vect_f32_max_exponent()
can be used to determine the value for \(a\_exp\) which minimizes headroom of the output vector. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \leftarrow round(\frac{b_k}{2^{b\_exp}}) \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameter Details

a[]
represents the 32bit output mantissa vector \(\bar a\).b[]
represents the IEEE754 float input vector \(\bar b\).a[]
andb[]
must each begin at a doublewordaligned address.b[]
can be safely updated inplace.length
is the number of elements in each of the vectors.a_exp
is the exponent associated with the output vector \(\bar a\).
See also
See also
 Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
a_exp – [in] Exponent \(a\_exp\) of output vector \(\bar a\)
 Throws ET_LOAD_STORE:
Raised if
a
orb
is not doublewordaligned (See Note: Vector Alignment) Throws ET_ARITHMETIC:
Raised if Any element of
b
is infinite or notanumber.

float vect_f32_dot(const float b[], const float c[], const unsigned length)#
Compute the inner product of two IEEE754 float vectors.
This function takes two vectors of IEEE754 singleprecision floats and computes their inner product — the sum of the elementwise products. The
FMACC
instruction is used, granting full precision in the addition.The inner product \(a\) is returned.
 Operation Performed:
 \[\begin{flalign*} & a \leftarrow \sum_{k=0}^{length1} ( b_k \cdot c_k ) && \end{flalign*}\]
 Parameters:
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar b\) and \(\bar c\)
 Returns:
The inner product

void vect_f32_add(float a[], const float b[], const float c[], const unsigned length)#
Adds together two IEEE754 float vectors.
This function takes two vectors of IEEE754 singleprecision floats and computes the elementwise sum of the two vectors.
a[]
is the output vector \(\bar a\) into which results are placed.b[]
andc[]
are the input vectors \(\bar b\) and \(\bar c\) respectively.a
,b
andc
each must begin at a doublewordaligned address.This operation can be performed safely inplace on
b[]
orc[]
. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \gets b_k + c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
 Throws ET_LOAD_STORE:
Raised if
a
,b
orc
is not doublewordaligned (See Note: Vector Alignment)

void vect_complex_f32_add(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#
Adds together two complex IEEE754 float vectors.
This function takes two vectors \(\bar b\) and \(\bar c\) of complex IEEE754 singleprecision floats and computes the elementwise sum of the two vectors.
a[]
is the output vector \(\bar a\) into which results are placed.b[]
andc[]
are the complex input vectors \(\bar b\) and \(\bar c\) respectively.a
,b
andc
each must begin at a doublewordaligned address.This operation can be performed safely inplace on
b[]
orc[]
. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \gets b_k + c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
 Throws ET_LOAD_STORE:
Raised if
a
,b
orc
is not doublewordaligned (See Note: Vector Alignment)

void vect_complex_f32_mul(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#
Multiplies together two complex IEEE754 float vectors.
This function takes two complex float vectors \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as \(b_k\) multiplied by \(c_k\) (using complex multiplication).
a[]
is the output vector \(\bar a\) into which results are placed.b[]
andc[]
are the complex input vectors \(\bar b\) and \(\bar c\) respectively.a
,b
andc
each must begin at a doublewordaligned address.This operation can be performed safely inplace on
b[]
orc[]
. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \gets b_k \cdot c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
 Throws ET_LOAD_STORE:
Raised if
a
,b
orc
is not doublewordaligned (See Note: Vector Alignment)

void vect_complex_f32_conj_mul(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#
Conjugate multiplies together two complex IEEE754 float vectors.
This function takes two complex float vectors \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as \(b_k\) multiplied by the complex conjugate of \(c_k\) (using complex multiplication).
a[]
is the output vector \(\bar a\) into which results are placed.b[]
andc[]
are the complex input vectors \(\bar b\) and \(\bar c\) respectively.a
,b
andc
each must begin at a doublewordaligned address.This operation can be performed safely inplace on
b[]
orc[]
. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \gets b_k \cdot (c_k^*) \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
 Throws ET_LOAD_STORE:
Raised if
a
,b
orc
is not doublewordaligned (See Note: Vector Alignment)

void vect_complex_f32_macc(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#
Adds the product of two complex IEEE754 float vectors to a third float vector.
This function takes three complex float vectors \(\bar a\), \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as input \(a_k\) plus \(b_k\) multiplied by \(c_k\).
a[]
is accumulator vector \(\bar a\), serving as both input and output.b[]
andc[]
are the complex input vectors \(\bar b\) and \(\bar c\) respectively.a
,b
andc
each must begin at a doublewordaligned address. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \gets a_k + b_k \cdot c_k \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameters:
a – [inout] Input/Output accumulator vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
 Throws ET_LOAD_STORE:
Raised if
a
,b
orc
is not doublewordaligned (See Note: Vector Alignment)

void vect_complex_f32_conj_macc(complex_float_t a[], const complex_float_t b[], const complex_float_t c[], const unsigned length)#
Adds the product of two complex IEEE754 float vectors to a third float vector.
This function takes three complex float vectors \(\bar a\), \(\bar b\) and \(\bar c\) as inputs. Each output element \(a_k\) is computed as input \(a_k\) plus \(b_k\) multiplied by the complex conjugate of \(c_k\).
a[]
is accumulator vector \(\bar a\), serving as both input and output.b[]
andc[]
are the complex input vectors \(\bar b\) and \(\bar c\) respectively.a
,b
andc
each must begin at a doublewordaligned address. Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \gets a_k + b_k \cdot (c_k^*) \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameters:
a – [inout] Input/Output accumulator vector \(\bar a\)
b – [in] Input vector \(\bar b\)
c – [in] Input vector \(\bar c\)
length – [in] Number of elements in vectors \(\bar a\), \(\bar b\) and \(\bar c\)
 Throws ET_LOAD_STORE:
Raised if
a
,b
orc
is not doublewordaligned (See Note: Vector Alignment)

void vect_s32_to_vect_f32(float a[], const int32_t b[], const unsigned length, const exponent_t b_exp)#
Convert a 32bit BFP vector into a vector of IEEE754 singleprecision floats.
This function converts a 32bit mantissa vector and exponent \(\bar b \cdot 2^{b\_exp}\) into a vector of 32bit IEEE754 singleprecision floatingpoint elements \(\bar a\). Conceptually, the elements of output vector \(\bar a\) represent the same values as those of the input vector.
Because IEEE754 singleprecision floats hold fewer mantissa bits, this operation may result in a loss of precision for some elements.
 Operation Performed:
 \[\begin{split}\begin{flalign*} & a_k \leftarrow b_k \cdot 2^{b\_exp} \\ & \qquad\text{ for }k\in 0\ ...\ (length1) && \end{flalign*}\end{split}\]
 Parameter Details

a[]
represents the output IEEE754 float vector \(\bar a\).b[]
represents the 32bit input mantissa vector \(\bar b\).a[]
andb[]
must each begin at a doublewordaligned address.b[]
can be safely updated inplace.length
is the number of elements in each of the vectors.b_exp
is the exponent associated with the input vector \(\bar b\).
See also
 Parameters:
a – [out] Output vector \(\bar a\)
b – [in] Input vector \(\bar b\)
length – [in] Number of elements in vectors \(\bar a\) and \(\bar b\)
b_exp – [in] Exponent \(b\_exp\) of input vector \(\bar b\)
 Throws ET_LOAD_STORE:
Raised if
a
orb
is not doublewordaligned (See Note: Vector Alignment)

complex_float_t *fft_f32_forward(float x[], const unsigned fft_length)#