Floating Point Arithmetic

1. Idealized Floating Point
- 1.1. Encoding Real Numbers as Idealized Floating Point
2. Representing numbers
3. References

1. Idealized Floating Point

Let \(\beta\geq2\) be a fixed integer called the "base" (or "radix"), and \(t\geq1\) a fixed integer called the "precision". We define the set of Idealized Floating Point as consisting of rational numbers \[\mathbf{F}_{\beta,t} = \{0\}\cup\{\pm(m/\beta^{t})\beta^{c}\mid \beta^{t-1}\leq m\lt\beta^{t},\;m,c\in\ZZ\}. \] The \(m/\beta^{t}\) is called the Mantissa (or "Significand"), and the \(c\) integer is called the Exponent.

This is "idealized" because the exponent is unbounded. The goal is to capture the difficulties we'll encounter when doing numerical analysis using IEEE-754 floating point, but ignore the rare difficulties of overflow and underflow.

IEEE-754 floating point specifies that \(\beta=2\) or \(\beta=10\). For \(\beta=2\), single-precision has \(t=24\) and double-precision has \(t=53\). For \(\beta=10\), single-precision has \(t=7\) and double-precision has \(t=16\).
Notation convention: I intentionally use a bold upright capital "F", not a blackboard bold, since it's a number system but fails to be a ring or field.

Idealized floating point has the following properties:

It's self-similar: \(\beta\mathbf{F}_{\beta,t} = \mathbf{F}_{\beta,t}\)
It doesn't contain the integers \(\ZZ\not\subset\mathbf{F}_{\beta,t}\)
\(\mathbf{F}_{\beta,t}\) is countably infinite
If we let \(0\lt m\lt\beta^{t}\), then it's not unique; i.e., we could have multiple different choices of \(m\) and exponent \(c\) which correspond to the same rational number \((m/\beta^{t})\beta^{c}\).

What does this look like? To gain some intuition, we pick \(t=1\) and \(\beta=10\), then the distribution of positive floating point numbers look like:

\begin{equation} \mathbf{F}_{\beta,t}\supset \{1/10, 2/10, \dots, 9/10\}\cup\{1, 2,\dots, 9\}\cup\{10,20,\dots,90\}\dots. \end{equation}

We see this is logarithmically distributed, arbitrarily close to zero, but the spacing between floating point numbers grows as the exponent grows.

1.1. Encoding Real Numbers as Idealized Floating Point

For us to use floating point to compute anything, we need to encode real numbers into the floating point system. This requires some care with how to handle real numbers half-way between two floating point numbers.

We have a function \(fl\colon\RR\to\mathbf{F}_{\beta,t}\) which assigns to each real number its closest floating point number, and rounds-up-on-nearest ("banker's rounding").

The rounding scheme is not terribly important, though it is a fascinating topic since bad rounding schemes may unintentionally skew the distribution of floating point numbers. This could be a new source of error.

Now how do we determine the error of this encoding? We can use the self-similarity of idealized floating point numbers to determine the error of one number half-way between two possible floating point numbers, then scale it accordingly. Let's define this error for a number greater than 1 but half-way between \(1\in\mathbf{F}_{\beta,t}\) and the next greatest floating point number:

We define the Machine Epsilon of \(\mathbf{F}_{\beta,t}\) to be the real number \(\varepsilon_{\text{machine}}\) defined as \[ \varepsilon_{\text{machine}} =_{df} \frac{\min\{x\in\mathbf{F}_{\beta,t}\mid x\gt1\}-1}{2} \] or explicitly \[ \varepsilon_{\text{machine}} = \frac{1}{2}\beta^{1-t}. \]

For any real number \(x\in\RR\), there exists an \(\varepsilon\in\RR\) with \(|\varepsilon|\lt\varepsilon_{\text{machine}}\) such that

\begin{equation} fl(x) = x(1+\varepsilon). \end{equation}

That is, the difference between a real number and its closest floating point number is always smaller than machine epsilon in relative error.

We adopt the notation of \(fl^{-1}\colon\mathbf{F}_{\beta,t}\to\RR\) to translate floating-point back to real numbers. Also, if we want to add two floating-point numbers using IEEE-754 arithmetic rules, we adopt the convention of "circling" the operation \(\oplus\colon\mathbf{F}\times\mathbf{F}\to\mathbf{F}\) for example. This is just to clarify "What operation we're performing: floating point arithmetic, or the arithmetic of real numbers."

(Fundamental Axiom of Floating Point Arithmetic) Let \(\star\in\{+,-,\times,/\}\) be some binary operator of real numbers, and let \(fl(\star)\) be the corresponding operation for floating point numbers.

For all floating point numbers \(x,y\in\mathbf{F}_{\beta,t}\), there exists an \(\varepsilon\in\RR\) with \(|\varepsilon|\leq\varepsilon_{\text{machine}}\) such that

\begin{equation} fl^{-1}(x\mathbin{fl(\star)} y) = (fl^{-1}(x)\star fl^{-1}(y))(1+\varepsilon). \end{equation}

Two remarks are worth stressing at this point:

The fundamental axiom of floating point arithmetic is a statement of relative error of floating point operations, not absolute error.
We could use the fundamental axiom of floating point arithmetic as a definition of floating point operations: \(x\mathbin{fl(\star)}y := fl(fl^{-1}(x)\star fl^{-1}(y))\) treating the operands as real numbers, performing the operation on the floating point numbers treated as rational numbers, then translating back to the floating point encoding.

2. Representing numbers

2.1. Motivational example

The basic idea could be discussed by first considering a number system where a "number" is a triple (s, c, f) where

s is either +1 or -1
c is a 7-tuple of digits (c[0], ..., c[6]) where c[0] is nonzero
f is a 4-tuple of (f[0], f[1], ..., f[3]) where f[0] is either +1 or -1, and the remaining components are digits

We intuitively think of this as a sort of truncated scientific notation s*(c[0].c[1]c[2]c[3]c[4]c[5]c[6])*pow(10, f[0]*(f[1]f[2]f[3])) where the c is a decimal encoding of a real number, f[1:3] encodes a 3-digit natural number.

Observe, if c[0] were zero, we could shift all the c tuple entries down by one (i.e., set c'[0] = c[1], c'[1] = c[2], …, c'[5]=c[6], then set c'[6]=0 and use c' instead of c at the cost of decreasing the f exponent by 1). Consequently, the only time c[0] were zero is if we had zero.

There's no reason why we must use base-10, we could use base b. We just require c[0] is a nonzero b-igit, and we interpret the order of magnitude is taken to base-b, i.e., that we multiply now by pow(b, f[0]*(f[1]f[2]f[3])). For base-2, this forces c[0] to be 1, and hence we can just assume there is an implicit leading 1 (i.e., we can pretend the c tuple describes the bits "after the decimal point").

2.2. Rounding, Truncating

Given some number with decimal expansion d[0]d[1]d[2]d[3]d[4]d[5]d[6]d[7]..., how do we encode it in our toy number system?

Solution 1. Do we simply throw away the digits d[7] and beyond? This is called "truncating".
Solution 2. Or we could examine d[7] and if it is 5 or greater, add 1 to d[6] (which, if d[6]=9, would become 0 and force us to "carry the one" to d[5], and so on). This is "rounding to infinity".

Either way, rounding/truncating is one source of error. But it's a price we always have to pay for using a computer.

Let \(x\) be a real number. Then the encoding of it to our number system corresponds to mapping it to the number \(fl(x)\).

We will continue working in our system of 7-digit scientific numbers, with 3-digit exponents.

The choice of 7 digits and 3-digit exponents is not arbitrary, this turns out to be one of the standard floating point representations for numbers, the single-precision decimal float. Working in base-10 is more familiar to the reader than working in base-2 (at least, for scientific computation), so we use it to clarify certain concepts.

2.3. Relative, Absolute Error

How do we measure how good (or bad) an approximation is to its true value? We have two possibilities.

Let \(x\) be a real number and \(x_{\ast}\) an approximation to \(x\). Then its Absolute Error is \(|x - x_{\ast}|\).

Let \(x\) be a real number and \(x_{\ast}\) an approximation to \(x\). Then its Relative Error is \(|x - x_{\ast}|/x\) if \(x\neq0\).

When we have two numbers close to each other \(fl(x)\approx fl(y)\), say they agree to 4 digits, then \(fl(x) - fl(y)\) has at most three digits of precision (we've lost more than half our precision). The absolute error of \(fl(x) - fl(y)\) compared to \(x - y\) is small, though the relative error is large.

3. References

Richard L. Burden and J. Douglas Faires, Numerical Analysis. 8th ed., Thomson, 2005.
David Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic". March 1991, eprint.
N.L. Trefethen and David Bau III, Numerical Linear Algebra. SIAM, 1997. This is where I took inspiration for discussing "idealized floating point", their lecture 13.