Miscellaneous Notes

February 2026
Aathreya Kadambi

This note feels criminal to my mission of organizing my notes in a clean way, but alas, I think it’s necessary.

Computer Science

Floating Point Representation: IEEE 754

The floating point representation system is crucial to scientific computing. The vibe of this topic shouldn’t actually be new: IEEE-754 is really just glorified and slightly modified scientific notation. The implementation, however, utilizes several nuances that are specific to computer representations.

Vibe. Scientific notation (representing real decimal numbers in the form a.bcde...×10ka.bcde... \times 10^k) is a useful way to represent a broad range of numbers with consistent relative (as opposed to absolute) precision.

Nuance. Computer representations are binary.

Since computers can’t represent “decimal points”, “times”, and “exponentiation” in an easily human-readable format, we need a sort of standardized way to interpret a collection of bits as a real number. That’s where IEEE-754 comes in.

Check out Lukas Kollmer’s visualizer for a good visual on how to interpret IEEE 754.

Another good resource is the table at this link which I reproduce here for ease:

Single Precision (32-bit)

Double Precision (64-bit)

Object Represented

E (8)M (23)E (11)M (52)
0000true zero (±0)
0nonzero0nonzerodenormalized (subnormal) number
1–254anything1–2046anythingnormalized floating point number
255020470infinity (±∞)
255nonzero2047nonzeronot a number (NaN)

An introductory experiment to try on Kollmer’s visualizer: flip all the exponents to zero. Can you use the chart to explain why the green number in the centered formula is still one, despite it technically being zero in binary?

The purpose of denormalized or subnormal numbers is to allow a much larger range at small magnitudes by utilizing the full length of the mantissa for improved precision.

To simply explain the standard, there are two pieces. But first, let us introduce a new type of scientific notation:

First, the usual scientific notation corresponds to normalized representations:

±1.m×2e\pm 1.m \times 2^e
and the new one corresponds to what we call denormalized representations:
±0.m×2e+1\pm 0.m \times 2^{e+1}

For the following, let us say that we are using a floating point representation with a bias of bb. For our convention, b<0b < 0 and is a negative number. This is what I’ve seen at UC Berkeley, but different opinions seem to exist. If your bias is positive, it likely means that you should be swapping all the signs in front of bb in the formulas below, or you’ve encountered a very strange exam question. 🪦

From Human Representations to IEEE 754 Floating Point Representations

As you might have noticed above, regardless of whether we use the normalized or denormalized representations, there are three components:

  1. ±\pm: the sign of your number. This takes one bit to represent.
  2. mm: the “mantissa” or “significand” of your number.
  3. ee: the “exponent” of your number.

Mantissa is a historically loaded word, so the word significand is often slightly preferred, but to be honest at UC Berkeley people seem to say mantissa more frequently in my experience.

We need to put these three pieces into the three pieces of the IEEE 754 representation, which are conveniently named to refer to which part of the number they store information about. SS is a single bit representing the sign, EE consists of bits representing the exponent, and MM consists of bits representing the mantissa.

Nuance. Since binary strings are most easily understood as nonnegative integers, we utilize the “bias” to offset the raw number in EE to obtain ee.

In the end, we will try to compute SS, EE, and MM, which are binary strings representing the pieces of information above. Then, our number can be stored as SEMSEM in memory.

  1. SS is simply 0 if your number is positive and 1 if your number is negative.
  2. EE is:
    • Normalized case: If e>be > b, E=ebE = e - b.
    • Denormalized case: If ebe \le b, rearrange your number into the denormalized representation above to make e=be = b, and set E=0E = 0.
  3. MM is always mm, however what mm means corresponds to either the normalized or denormalized human representation above.

And boom! If you use these, you should get your IEEE 754 representation.

From IEEE 754 Floating Point Representations

Suppose you are now given an IEEE 754 representations, with binary strings: SS, EE, and MM. We now want to recover the sign, ee, and mm of our number.

  1. If SS is 0, our number is positive, and if it’s 1, it’s negative.
  2. ee is always E+bE + b.
    • If E=0E = 0, you should convert into the denormalized human representation,
    • If E>0E > 0, you should convert into the normalized human representation.
  3. mm is always MM, but be careful to use the normalized or denormalized human representation above.

And that’s it!

Some Other Useful Formulas

Another interesting point of discussion is the “step size”. Suppose your representation has pp mantissa bits. If your exponent is ee, then the step size is 2ep2^{e - p} in the normalized case, and 21+ep2^{1+e-p} in the denormalized case. Alternatively, since e=E+be = E + b, the step size is also 2E+bp2^{E + b - p} in the normalized case, and 21+E+bp2^{1+E+b-p} in the denormalized case. Since E=0E = 0 for denormalized numbers, this is 21+bp2^{1 + b - p}.



As a fun fact, it might seem like this website is flat because you're viewing it on a flat screen, but the curvature of this website actually isn't zero. ;-)

Copyright © 2026, Aathreya Kadambi

Made with Astrojs, React, and Tailwind.