FPGAs in high-frequency trading: the anatomy of a nanosecond

High-frequency trading is the one industry where FPGAs aren't the alternative — they're the incumbent. If your strategy's edge is being first to react to a market event, the wire-to-wire reaction time of your system is the product, and the past decade of that arms race has been fought in FPGA fabric. This post is a tour of how those systems actually work, and what the nanoseconds are spent on.

The latency ladder

Reacting to a market-data packet with an order, measured wire to wire:

Implementation Tick-to-trade
Ordinary software, kernel network stack 10–100 µs
Kernel-bypass software (DPDK/Onload, pinned cores) 1–5 µs
FPGA, full parse → strategy → order path 300 ns – 1 µs
FPGA, pre-armed trigger (pattern match → fire) 30–100 ns
Layer-1 switch doing pure fan-out/replication ~5 ns

Two things to notice. First, each rung is roughly an order of magnitude — this is why "we rewrote it in C++ with kernel bypass" stopped being competitive for the fastest strategies years ago. Second, and more important for engineering: the FPGA numbers are fixed. A software stack has a p50 and a p99.9 separated by scheduler wakeups, cache misses and GC pauses; a synchronous pipeline takes the same number of clock cycles every single time. In a race, variance is just latency you sometimes have.

Where the nanoseconds go

At 10G Ethernet, one byte takes 0.8 ns on the wire; a 64-byte frame is ~51 ns. At a 322 MHz core clock (the natural rate for a 64-bit 10G datapath), one cycle is ~3.1 ns. A 300 ns tick-to-trade budget is therefore about a hundred clock cycles, total, spent roughly like this:

MAC/PHY ingress (cut-through)        ~30-60 ns
protocol parse + symbol filter       ~10-30 ns
book update / strategy primitive     ~10-50 ns
order template fill + checksum       ~10-30 ns
MAC/PHY egress                       ~30-60 ns

Every stage is a pipeline working on the packet as it arrives. Nothing waits for a full frame: by the time an ITCH add-order message's last byte lands, the decision logic has already seen the symbol and price fields. This is cut-through processing, and it's the single biggest philosophical difference from software, which fundamentally works store-and-forward.

The anatomy of the fast path

A representative system:

Market data in (UDP multicast). Exchange feeds like NASDAQ ITCH arrive as binary messages over UDP — friendly territory for hardware. A parser written as a streaming state machine extracts message type, symbol, price, size in the cycles they fly past. A symbol filter (a hash table or CAM in BRAM) discards the 99% of the feed you don't trade within nanoseconds of reading the field.

Book building. For many strategies you don't need the whole order book in fabric — top-of-book (best bid/ask) per instrument, held in BRAM, is enough for the trigger. Deep books and analytics live in software, off the critical path.

The strategy primitive. Here's the part outsiders find anticlimactic: the in-fabric "strategy" is usually a handful of comparators — if the new ask crosses my resting threshold for symbol X, fire order template Y. The intelligence lives in software, which continuously computes thresholds and templates and writes them into the fabric through a control-plane register block (exactly the kind you'd build with our register-map generator — the fast path reads parameters from registers; the slow path updates them over AXI-Lite).

Order out (TCP, tamed). Order-entry protocols usually ride TCP, a real engineering challenge in hardware — so the fast path uses a hardware TCP engine handling only the hot connections, with sequence numbers, templates and checksums precomputed. The order message sits pre-built in fabric with just price/size/timestamp fields to patch, so "send" means patching a few bytes and updating checksums, not building a packet.

The dirty trick: speculative transmission. The most famous nanosecond-shaving technique: start clocking the order frame out before the decision is final, and if the trigger evaluates false mid-frame, deliberately corrupt the Ethernet FCS so the exchange's switch drops it. The frame check sequence is just a CRC-32 — and computing it correctly (or deliberately incorrectly) at line rate is one of those places where a verified parallel CRC generator earns its keep.

The unglamorous parts that make it work

The honest caveats

The arms race has brutal economics. Colocation, exchange cross-connects and low-latency market-data licenses dwarf the hardware cost; the engineering iteration loop (hours per build vs. seconds per compile) means a small strategy tweak that takes a quant minutes in Python takes days to land in fabric. That's why the industry converged on the hybrid: software decides what to want, hardware decides when to fire. Firms only push logic into the FPGA when its latency sensitivity justifies the iteration cost — and the fastest firms have pushed further still, into custom ASICs for the most stable parts of the pipeline.

And not every strategy needs any of this. Market making at the top of the ladder does; anything holding positions for minutes doesn't. Plenty of profitable trading runs happily on kernel-bypass software, and knowing which regime you're in is worth more than any nanosecond.

Why it's worth studying even if you never trade

An HFT fast path is digital design distilled: cut-through streaming, single-domain clocking, latency-exact pipelines, hardware/software partitioning with a clean register interface, and verification with real consequences. Every one of those skills transfers to radar, 5G, storage and instrumentation — the other places where the answer must be correct and on time, every time.

If this is the kind of engineering you enjoy, the toolbox on this site is largely made of its building blocks. Start with the FIFO calculator and the CRC generator, and time yourself.