How many bits do you actually need? Precision as a design knob
Software gives you a fixed menu of number formats: float, double, maybe half. Hardware hands you the whole kitchen. On an FPGA, every width is a design decision, and it's one of the highest-leverage decisions you can make, because bits you don't need cost area, power, and clock speed, while bits you do need are the difference between an algorithm that converges and one that spins forever.
This post puts real numbers on both sides of that trade, synthesized on our own toolchain, simulated with quantized arithmetic, and ends with a method you can apply to your next datapath.
Anatomy first: what each bit buys
A floating-point number spends its bits on two different products: exponent bits buy dynamic range (how big and how small), mantissa bits buy precision (how fine). They are separate budgets, and the formats above are just different shopping decisions: bfloat16 keeps FP32's entire range and sacrifices precision; FP16 does the reverse; FP8-E4M3, the AI-inference workhorse, spends almost nothing on either and gets away with it because neural networks are noise-tolerant.
On an FPGA you aren't limited to the named formats. A custom
fp(e=6, m=9) is one parameter away, if you know what the mantissa
width does to your hardware. So let's measure it.
The multiplier bill (real synthesis numbers)
A floating-point multiplier is mostly one thing:
The sign is an XOR. The exponents add, one carry chain. But the
mantissas multiply, and integer multiplication generates on the
order of m² partial products. We synthesized bare a * b multipliers
with Yosys (generic 4-LUT mapping, the same estimate the
playground's Synth button gives you):
| Significand width | LUT4s | Critical path (LUT levels) |
|---|---|---|
| 5 | 55 | 7 |
| 8 (bfloat16) | 187 | 9 |
| 11 (FP16) | 344 | 11 |
| 14 | 557 | 12 |
| 17 | 867 | 13 |
| 24 (FP32) | 1657 | 15 |
Two readings of that table:
- Area is brutal: FP32's 24-bit significand costs 4.8× the LUTs of FP16's and 8.9× bfloat16's. The area curve is the m² you'd predict.
- Speed follows depth: the combinational path drops from 15 LUT levels to 9 going from 24 to 8 bits, roughly 40% shorter, which is either a faster clock or pipeline stages you no longer need. And on real devices there's a cliff the generic numbers don't show: a DSP48 multiplies up to 18×18 in one hardened block, so a ≤17-bit significand fits a single DSP while FP32's 24 bits force a multi-DSP cascade, more blocks and a longer path through the cascade wiring.
Small mantissas aren't a compromise. They're a genuinely different, faster circuit.
The other side: when the algorithm pushes back
So why not always compute in FP8? Because some algorithms consume precision as a resource. We simulated two classics with every arithmetic operation rounded to m mantissa bits (the quantization harness is ten lines of Python, below).
Jacobi iteration on a small linear system, target 10⁻³ relative error:
| Mantissa bits | Iterations to converge |
|---|---|
| 6–9 | stalls, never converges |
| 10 | 8 |
| 12 | 8 |
| 16 | 8 |
| 23 | 8 |
Newton-Raphson reciprocal, target 10⁻⁴:
| Mantissa bits | Iterations |
|---|---|
| 8–13 | precision floor, never reaches target |
| 14 | 3 |
| 23 | 3 |
Here's the insight the smooth "speed vs. accuracy tradeoff" story misses: in both experiments the behavior is a cliff, not a slope. Below the threshold, rounding error injected each iteration is larger than the progress the iteration makes, the residual hits a noise floor and parks there forever. Above the threshold, the iteration count is flat: Jacobi takes 8 iterations at 10 bits and still 8 at 23 bits. Those extra 13 bits of mantissa, nearly 5× the multiplier area, buy literally nothing.
The gradual version of the trade does exist: methods that accumulate their answer (iterative refinement, conjugate-gradient variants, long-running integrators) gain a roughly fixed number of correct bits per pass, so halving precision genuinely means more passes, and tighter targets pull the required precision up with them. That's where the celebrated mixed-precision recipes come from: iterate cheap and low, then refine the last digits in a few high-precision passes, today's fastest linear-algebra records are set exactly this way.
The working method
- Range first. Find the min/max magnitudes your signals ever take (simulate!). That sets exponent bits, or tells you the range is narrow enough for fixed point, which deletes the exponent hardware entirely.
- Golden-model the precision. Before any RTL, run your algorithm with quantized arithmetic and find the cliff. Ten lines does it:
import math
def q(x, m): # round x to m mantissa bits
if x == 0: return 0.0
f, e = math.frexp(x)
return math.ldexp(round(f * (1 << m)) / (1 << m), e)
# ...wrap every multiply/add in q(...), sweep m, plot convergence
- Sit one bit above the cliff, then verify with margin. Real data is noisier than your model; test with representative inputs, keep a guard bit or two.
- Multiply narrow, accumulate wide. The multiplier dominates area, the accumulator doesn't, so an INT8 × INT8 multiply feeding a 32-bit accumulator costs almost nothing extra and eliminates an entire class of rounding trouble. (This is precisely the shape of the MAC units in our library's upcoming fixed-point tier.)
- Measure, don't guess. Paste a bare multiplier into the playground at your candidate widths and press Synth stats: the table above took minutes to produce, and yours will too.
Bits are the one resource where the right answer is knowable before you build. Find your cliff, stand one step above it, and spend the silicon you saved on something that makes the product better.