Exposing Floating Point – Bartosz Ciechanowski

C ciechanow.ski ↗

▲ 87 points • 11 comments • by subset • 2mo ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

0 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 6 of 6

SEGMENTS · AI 0 of 6

WORD COUNT 1,669

PEAK AI % 0% · §5

Analyzed

Apr 26

backend: pangram/v3.3

Segments scanned

6 windows

avg 278 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 1,669 words · 6 segments analyzed

Human AI-generated

§1 Human · 0%

Despite everyday use, floating point numbers are often understood in a hand-wavy manner and their behavior raises many eyebrows. Over the course of this article I’d like to show that things aren’t actually that complicated.This blog post is a companion to my recently launched website – float.exposed. Other than exploiting the absurdity of present day list of top level domains, it’s intended to be a handy tool for inspecting floating point numbers. While I encourage you to play with it, the purpose of many of its elements may be exotic at first. By the time we’ve finished, however, all of them will hopefully become familiar.On a technical note, by floating point I’m referring to the ubiquitous IEEE 754 binary floating point format. Types half, float, and double are understood to be binary16, binary32, and binary64 respectively. There were other formats back in the day, but whatever device you’re reading this on is pretty much guaranteed to use IEEE 754.With the formalities out of the way, let’s start at the shallow end of the pool.Writing Numbers We’ll begin with the very basics of writing numeric values. The initial steps may seem trivial, but starting from the first principles will help us build a working model of floating point numbers.Decimal Numbers Consider the number 327.849. Digits to the left of the decimal point represent increasing powers of ten, while digits to the right of the decimal point represent decreasing powers of ten: 3 102

2 101

7 100

8 10−1

4 10−2

9 10−3 Even though this notation is very natural, it has a few disadvantages: small numbers like 0.000000000653 require skimming over many zeros before they start “showing” actually useful digits it’s hard to estimate the magnitude of large numbers like 7298345251 at a glance at some point the distant digits of

§2 Human · 0%

a number become increasingly less significant and could often be dropped, yet for big numbers we don’t save any space by replacing them with zeros, e.g. 7298000000 By “small” and “big” numbers I’m referring to their magnitude so −4205 is understood to be bigger than 0.03 even though it’s to the left of it on the real number line.Scientific notation solves all these problems. It shifts the decimal point to right after the first non-zero digit and sets the exponent accordingly:+3.27849×102Scientific notation has three major components: the sign (+), the significand (3.27849), and the exponent (2). For positive values the “+” sign is often omitted, but we’ll keep it around for the sake of verbosity. Note that the “10” simply shows that we’re dealing with base-10 system. The aforementioned disadvantages disappear: the 0-heavy small number is presented as 6.53×10−10 with all the pesky zeros removed just by looking at the first digit and the exponent of 7.298345251×109 we know that number is roughly 7 billion we can drop the unwanted distant digits from the tail to get 7.298×109 Continuing with the protagonist of this section, if we’re only interested in 4 most significant digits we can round the number using one of the many rounding rules:+3.278×102The number of digits shown describes the precision we’re dealing with. A number with 8 digits of precision could be printed as:+3.2784900×102Binary Numbers With the familiar base-10 out of the way, let’s look at the binary numbers. The rules of the game are exactly the same, it’s just that the base is 2 and not 10. Digits to the left of the binary point represent increasing powers of two, while digits to the right of the binary point represent decreasing powers of two: 1 23

0 22

0 21

1 20

§3 Human · 0%

0 2−1

1 2−2

0 2−3

1 2−4 When ambiguous I’ll use 2 to mean the number is in base-2. As such, 10002 is not a thousand, but 23 i.e. eight. To get the decimal value of the discussed 1001.01012 we simply sum up the powers of two that have the bit set: 8 + 1 + 0.25 + 0.0625, ending up with the value of 9.3125.Binary numbers can use scientific notation as well. Since we’re shifting the binary point by three places, the exponent ends up having the value of 3:+1.0010101×23 Similarly to scientific notation in base-10, we also moved the binary point to right after the first non-zero digit of the original representation. However, since the only non-zero digit in base-2 system is 1, every non-zero binary number in scientific notation starts with a 1.We can round the number to a shorter form:+1.0011×23Or show that we’re more accurate by storing 11 binary digits:+1.0010101000×23If you’ve grasped everything that we’ve discussed so far then congratulations – you understand how floating point numbers work.Floating Point Numbers Floating points numbers are just numbers in base-2 scientific notation with the following two restrictions: limited number of digits in the significand limited range of the exponent – it can’t be greater than some maximum limit and also can’t be less than some minimum limit That’s (almost) all there is to them.Different floating point types have different number of significand digits and allowed exponent range. For example, a float has 24 binary digits (i.e. bits) of significand and the exponent range of [−126, +127], where “[” and “]” denote inclusivity of the range (e.g. +127 is valid, but +128 is not).

§4 Human · 0%

Here’s a number with a decimal value of −616134.5625 that can fit in a float:−1.00101100110110001101001×219Unfortunately, the number of bits of significand in a float is limited, so some real values may not be perfectly representable in the floating point form. A decimal number 0.2 has the following base-2 representation:+1.1001...×2−3The overline (technically known as vinculum) indicates a forever repeating value. The 25th and later significant digits of the perfect base-2 scientific representation of that number won’t fit in a float and have to be accounted for by rounding the remaining bits. The full significand: 1.100110011001100110011001100Will be rounded to: 1.10011001100110011001101 After multiplication by the exponent the resulting number has a different decimal value than the perfect 0.2: 0.20000000298023223876953125 If we tried rounding the full significand down: 1.10011001100110011001100 The resulting number would be equal to: 0.199999988079071044921875 No matter what we do, the limited number of bits in the significand prevents us from getting the correct result. This explains why some decimal numbers don’t have their exact floating point representation.Similarly, since the value of the exponent is limited, many huge and many tiny numbers won’t fit in a float: neither 2200 nor 2−300 can be represented since they don’t fall into the allowed exponent range of [−126, +127].Encoding Knowing the number of bits in the significand and the allowed range of the exponent we can start encoding floating point numbers into their binary representation.

§5 Human · 0%

We’ll use the number −2343.53125 which has the following representation in base-2 scientific notation:−1.0010010011110001×211The Sign The sign is easy – we just need 1 bit to express whether the number is positive or negative. IEEE 754 uses the value of 0 for the former and 1 for the latter. Since the discussed number is negative we’ll use one:1The Significand For the significand of a float we need 24 bits. However, per what we’ve already discussed, the first digit of the significand in base-2 is always 1, so the format cleverly skips it to save a bit. We just have to remember it’s there when doing calculations. We copy the remaining 23 digits verbatim while filling in the missing bits at the end with 0s:00100100111100010000000The leading “1” we skipped is often referred to as an “implicit bit”.The Exponent Since the exponent range of [−126, +127] allows 254 possible values, we’ll need 8 bits to store it. To avoid special handling of negative exponent values we’ll add a fixed bias to make sure no encoded exponent is negative.To obtain a biased exponent we’ll use the bias value of 127. While 126 would work for regular range of exponents, using 127 will let us reserve a biased value of 0 for special purposes. Biasing is just a matter of shifting all values to the right: The bias in a float For the discussed number we have to shift its exponent of 11 by 127 to get 138, or 100010102 and that’s what we will encode as the exponent:10001010Putting it All Together To conform with the standard we’ll put the sign bit first, then the exponent bits, and finally, the significand bits. While seemingly arbitrary, the order is part of the standard’s ingenuity.

§6 Human · 0%

By sticking all the pieces together a float is born:11000101000100100111100010000000The entire encoding occupies 32 bits. To verify we did things correctly we can fire up LLDB and let the hacky type punning do its work:(lldb) p -2343.53125f (float) $0 = -2343.53125

(lldb) p/t *(uint32_t *)&$0 (uint32_t) $1 = 0b11000101000100100111100010000000 While neither C nor C++ standards technically require a float or a double to be represented using IEEE 754 format, the rest of this article will sensibly assume so.The same procedure of encoding a number in base-2 scientific notation can be repeated for almost any number, however, some of them require special handling.Special Values The float exponent range allows 254 different values and with a bias of 127 we’re left with two yet unused biased exponent values: 0 and 255. Both are employed for very useful purposes.A Map of Floats A dry description doesn’t really paint a picture, so let’s present all the special values visually. In the following plot every dot represents a unique positive float: All the special values If you have trouble seeing color you can switch to the alternative version. If you don’t have trouble seeing color you can switch to the color version. Notice the necessary truncation of a large part of exponents and of a gigantic part of significand values. At your current viewing size you’d have to scroll through roughly window widths to see all the values of the significand.We’ve already discussed all the unmarked dots — the normal floats. It’s time to dive into the remaining values.Zero A float number with biased exponent value of 0 and all zeros in significand is interpreted as positive or negative 0.