Floating Point Number
Floating Point Number
Floating Point Number is represented in a computer according to IEEE 754.
According to the wikipedia, the format is specified by:
- base (or radix) \(b\), which is either 2 (binary) or 10 (decimal) in IEEE 754;
- a precision \(p\);
- an exponent range from \(emin\) to \(emax\), with \(emin = 1 - emax\) for all IEEE 754 formats.
The format can describe the next values
- Finite numbers
- finite numbers are described by three integers
- \(s\) : a sign (zero or one)
- \(c\) : a significant (or coefficient) having no more than \(p\) digits when written in base \(b\). (i.e., an integer in the range through 0 to \(b^p - 1\))
- \(q\) : an exponent such that \(emin \le q + p - 1 \le emax\).
- The numerical value of such a finite number is \((-1)^s \times c \times b^q\)
- There are two zero values, called signed zeros: the sign bit specifies whether a zero is +0 (positive zero) or -0 (negative zero).
- Two infinities: \(+\infty\) and \(-\infty\).
- Two kinds of NaN (not-a-number): a quiet NaN (qNaN) and a signaling NaN (sNaN).
The next illustration shows the layout for 32-bit floating point: