Floating Point Number

December 20, 2022

2022 · CS · CS

Floating Point Number

Floating Point Number is represented in a computer according to IEEE 754.

According to the wikipedia, the format is specified by:

base (or radix) \(b\), which is either 2 (binary) or 10 (decimal) in IEEE 754;
a precision \(p\);
an exponent range from \(emin\) to \(emax\), with \(emin = 1 - emax\) for all IEEE 754 formats.

The format can describe the next values

Finite numbers
- finite numbers are described by three integers
- \(s\) : a sign (zero or one)
- \(c\) : a significant (or coefficient) having no more than \(p\) digits when written in base \(b\). (i.e., an integer in the range through 0 to \(b^p - 1\))
- \(q\) : an exponent such that \(emin \le q + p - 1 \le emax\).
- The numerical value of such a finite number is \((-1)^s \times c \times b^q\)
- There are two zero values, called signed zeros: the sign bit specifies whether a zero is +0 (positive zero) or -0 (negative zero).
Two infinities: \(+\infty\) and \(-\infty\).
Two kinds of NaN (not-a-number): a quiet NaN (qNaN) and a signaling NaN (sNaN).

The next illustration shows the layout for 32-bit floating point: