Floating-point data type
Nearly all programming languages provide at least one floating-point data type, intended primarily for the representation of real numbers. A floating-point type must be capable of taking on both positive and negative values, as well as values that are many orders of magnitude greater than unity and values that are many orders of magnitude less than unity. A value of a floating-point data type is often known as a floating-point number, but this term should be used with caution because (unlike integer or real number) it is a purely computational construct and has no natural mathematical equivalent. Some floating-point types, including the most common specification, IEEE 754, support additional values that are not real numbers, such as indeterminate values (a.k.a. Not a Number, or NaN) and infinity.
Floating-point is a generalization of the scientific notation often used to represent very large and very small numbers, for example, the elementary charge in coulombs, . The decimal point floats, that is, it can correspond to any place in the actual number, as indicated by the exponent. For example, in the number , the decimal point in "1.602" actually represents the position immediately to the right of the nineteenth place after the decimal point when the number is explicitly written out in full form (-0.0000000000000000001^602). Likewise, the decimal point in (the speed of light in metres per second) actually represents the place immediately to the left of the eighth decimal place before the decimal point (2^99800000). An exponent of zero means the decimal point is exactly in its actual location.
A floating-point type represents the number positionally with a variable integer exponent. That is, it represents the number as where and . In our decimal scientific notation, we use , but computers usually use (binary, as usual). Generally, a fixed number of bits of the raw, digital representation of a floating-point type are allocated to representing the exponent , and a fixed number are allocated to the mantissa (also called the significand), , which is usually constrained to lie in the range to ensure (somewhat) unique representation. The consequence of this is that the number of significant figures in a floating-point number is generally independent of its magnitude. This is ideal for scientific computation, in which the magnitudes of quantities often have no real meaning but are simply a relic of the system of units used to represent them.
Floating-point data types cannot exactly store irrational numbers like or . Nor can they exactly represent numbers that have non-terminating expansions in base , such as when (or even , which terminates in base 10 but not in base 2). Thus:
- Tests for equality among floating-point numbers should not be used. The test
sqrt(2.0) * sqrt(2.0) == 2.0(in C) may evaluate to 0 (false), for example. Always test using absolute or relative tolerances instead (for example,
fabs(x-y) < 1e-8instead of
x == y).
- Avoid using floating-point numbers if possible. For example, it might be wise to store currency values in fixed-point format; $12.95 could be represented by the integer 1295, instead of the floating-point number 12.95 (which is probably inexact). If
yare integers, test
x*x == yinstead of
x == sqrt(y). The very common test
(double)a/b == (double)c/d(with all variables integral) should be replaced with
a*d == b*c.
Floating-point types by language
This section is in need of expansion.
Most implementations of most programming languages support at least one floating-point type that corresponds to either the binary32 or the binary64 type of IEEE 754 (q.v. for references).
|Language||IEEE 754 binary32 type||IEEE 754 binary64 type||Other floating-point types||Notes|
|C and C++|| Often
|| The standard is completely silent on the ranges of |
|| One is free to define custom floating-point types according to the |
|| See IEEE 754 for discussion. The |
|| The nature of |
- Some programming languages may also have built-in support for complex floating-point types, but it is worth emphasizing that they invariably represent the complex floating-point number internally as a pair of real floating-point numbers.
- see draft http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf