Scalar IEEE 754 Float API#
- group scalar_f32_api
Functions
-
void f32_unpack(int32_t *mantissa, exponent_t *exp, const float input)#
Unpack an IEEE 754 single-precision float into a 32-bit mantissa and exponent.
- Example
// Unpack 1.52345246 * 10^(-5) float val = 1.52345246e-5; int32_t mant; exponent_t exp; f32_unpack(&mant, &exp, val); printf("%ld * 2^(%d) <-- %e\n", mant, exp, val);
- Parameters:
mantissa – [out] Unpacked output mantissa
exp – [out] Unpacked output exponent
input – [in] Float value to be unpacked
-
void f32_unpack_s16(int16_t *mantissa, exponent_t *exp, const float input)#
Unpack an IEEE 754 single-precision float into a 16-bit mantissa and exponent.
- Example
// Unpack 1.52345246 * 10^(-5) float val = 1.52345246e-5; int16_t mant; exponent_t exp; f32_unpack_s16(&mant, &exp, val); printf("%ld * 2^(%d) <-- %e\n", mant, exp, val);
Note
This operation may result in a loss of precision.
- Parameters:
mantissa – [out] Unpacked output mantissa
exp – [out] Unpacked output exponent
input – [in] Float value to be unpacked
-
float_s32_t f32_to_float_s32(const float x)#
Convert an IEEE754
float
to a float_s32_t.- Parameters:
x – [in] Input value
- Throws ET_ARITHMETIC:
Raised if
x
is infinite or NaN- Returns:
float_s32_t
representation ofx
-
float_s32_t f64_to_float_s32(const double x)#
Convert an IEEE754
double
to a float_s32_t.Note
This operation may result in precision loss.
- Parameters:
x – [in] Input value
- Throws ET_ARITHMETIC:
Raised if
x
is infinite or NaN- Returns:
float_s32_t
representation ofx
-
float f32_sin(const float theta)#
Get the sine of a specified angle.
Computes \(sin(\theta)\) using the power series expansion of \(sin()\) truncated to 8 terms.
This implementation is meant to make optimal use of the XS3 floating-point unit.
- Parameters:
theta – [in] Angle \(\theta\) to compute the sine of (in radians)
- Throws ET_ARITHMETIC:
Raised if \(\theta\) is infinite or NaN
- Returns:
Sine of the angle \(\theta\)
-
float f32_cos(const float theta)#
Get the cosine of a specified angle.
Computes \(cos(\theta) = sin(\theta+\frac{\pi}{2}\) using the power series expansion of \(sin()\) truncated to 8 terms.
This implementation is meant to make optimal use of the XS3 floating-point unit.
- Parameters:
theta – [in] Angle \(\theta\) to compute the cosine of (in radians)
- Throws ET_ARITHMETIC:
Raised if \(\theta\) is infinite or NaN
- Returns:
Cosine of the angle \(\theta\)
-
float f32_log2(const float x)#
Get the base-2 logarithm of the specified value.
This function computes \(log_2(x)\) using the power series expansion of \(log_2()\) truncated to 11 terms.
- Parameters:
x – [in] Input value \(x\) to get the logarithm of.
- Throws ET_ARITHMETIC:
Raised if \(x\) is infinite or NaN
- Returns:
\(log_2(x)\)
-
float f32_power_series(const float x, const float b[], const unsigned N)#
Compute power series summation using specified coefficients.
This function is used to compute the sum of terms in a power series, truncated to \(N\) terms, starting with the \(x^0\) term.
b
is an \(N\)-element vector of coefficients \(\bar b\) which are multiplied by the corresponding powers of \(x\).\(N\) is the length of \(\bar b\) and number of terms to sum together.
- Operation Performed
- \[\begin{aligned} & a \leftarrow \sum_{k=0}^{N-1}\left( x^k,b_k \right) \end{aligned}\]
- Parameters:
x – [in] Input value \(x\).
b – [in] Vector of coefficients \(\bar b\).
N – [in] Number of power series terms to sum.
- Throws ET_ARITHMETIC:
Raised if \(x\) or any element of \(\bar b\) is infinite or NaN.
- Returns:
\(a\), the sum of the first \(N\) power series terms.
-
float f32_normA(exponent_t *p, const float x)#
Get a representation of the input \(x\) in normalized form A.
This function is used internally to transform a
float
value into a representation required for certain purposes.In particular, this function behaves much like
frexpf()
, where it is guaranteed that the returned value \(a\) is either \(0\) or that \(0.5 \le \left| a \right| < 1.0\), and the output exponent \(p\) is such that \(x = a \cdot 2^{p}\).In anticipation that future work may require alternative “normalized” representations, this form is being defined here as form A.
- Parameters:
p – [in] Output exponent \(p\)
x – [in] Input value \(x\)
- Throws ET_ARITHMETIC:
Raised if \(x\) or any element of \(\bar b\) is infinite or NaN.
- Returns:
\(a\) in normalized form A.
-
void f32_unpack(int32_t *mantissa, exponent_t *exp, const float input)#