The neural micro-kit: turning a trained network into verified hardware

We've written twice about neural networks and FPGAs. First the idea: the fabric is already shaped like a neural network, LUTs for neurons, routing for synapses. Then the discipline: how many bits you actually need, and why "multiply narrow, accumulate wide" is the golden rule of fixed-point datapaths. Today those posts become code you can clone: libfpga v0.3.0, the neural micro-kit.

From article to artifact

The kit is two things. First, a set of small, verified primitives, the pieces a neural layer is actually made of:

lfpga_mac: a signed multiply-accumulate. This is the atom. Every dot product, every neuron, every convolution is a pile of MACs. It multiplies narrow (8-bit inputs) and accumulates wide (32-bit), so no rounding happens until you deliberately requantize.
lfpga_relu: ReLU, leaky, and clipped activations, each folding the requantize-to-layer-format step in.
lfpga_fix_resize / fix_mult / fix_add: the fixed-point arithmetic underneath: round-half-up, saturate, never wrap.

Each ships the way everything in the library does: a self-checking testbench, Verilator-clean lint, and a Yosys synthesis check with honest LUT/FF numbers. The MAC is 326 LUT4s and 32 flip-flops; ReLU is 129 LUT4s; the whole tier is small enough to read in an afternoon.

The part that's actually new: a generator

Primitives are necessary but not sufficient. The interesting problem is composition: given a trained network, produce the hardware that runs it, and prove the hardware is correct. So the second half of the kit is gen/mlp_gen.py, a code generator.

Feed it a JSON spec, the network shape plus weights already quantized to Q4.4 integers, and it emits two files:

a pipelined Verilog inference core that composes lfpga_mac, lfpga_relu and lfpga_fix_resize into your exact topology, one registered stage per layer;
a self-checking testbench that runs a bit-exact software model of the same network and compares the RTL against it on every test vector you supplied.

python3 gen/mlp_gen.py my_network.json out/
# -> out/my_network.v  and  out/tb_my_network.v (self-checks vs the model)

The software model isn't approximate. It performs the identical shifts, the identical rounding, the identical saturation as the RTL, because both were written to the same fixed-point contract. When the testbench prints TB PASS, it means the silicon-ready core is provably equal to the model at every point tested, not "looks plausible on a waveform."

And because the generator and its example live in the repo, continuous integration regenerates the network and re-verifies it on every commit. The generator can't silently rot. That's the whole pitch of doing this in public: the trust is mechanical.

It's the same idea as fpga-neuron, generalized

If you read our fpga-neuron walkthrough, a single hand-built XOR network trained in pure Python and synthesized to 832 LUTs of pure combinational logic, this is that, promoted to a tool. fpga-neuron shows you the whole journey by hand so you understand it; the micro-kit lets you skip the hand-work for any MLP once you do. The example the kit ships is, fittingly, the same XOR network, now produced by the generator instead of typed by a human.

What it is and isn't (yet)

Honesty, as always. This is a micro-kit: it targets small, fully-connected networks, the regime where FPGAs genuinely shine, kHz-to- MHz inference at fixed, tiny latency and milliwatts of power. It is not (today) a convolutional accelerator or an MNIST-in-a-weekend framework; those are more of everything, and they're where the kit grows next. What it is is a correct, verified, readable foundation, and a generator that makes the correctness reproducible.

Try it in five minutes

git clone https://github.com/libfpga/libfpga
cd libfpga && make gen        # regenerate + verify the example

Or paste lfpga_mac straight into the playground and watch a multiply-accumulate do its thing. The building blocks of machine learning are smaller and more knowable than the mystique suggests, they're multipliers, adders, and a comparator, and now they're a library.

Follow @libfpga for what comes next.