Block RAM vs distributed RAM: where your memory actually goes
Write reg [7:0] mem [0:255]; and the synthesizer has a choice to make:
build it from dedicated block RAM, or assemble it from
the LUTs themselves (distributed RAM). The two have
different superpowers, and knowing which you'll get — and how to steer the
choice — is one of those small skills that pays off in every design.
Two kinds of silicon
Block RAM is dedicated SRAM: 36 Kb blocks (AMD), 20 Kb M20K (Intel), 18 Kb EBR (ECP5). True dual-port, optional output registers, byte enables. You have a fixed number of blocks and each is all-or-nothing — a 40-bit-deep scratchpad consumes a whole block, which is why the BRAM estimator exists.
Distributed RAM exploits a beautiful fact: a LUT is a small memory (that's the glossary definition). Fabrics that support it (AMD's SLICEM, some others) let you write to it at runtime — a 6-LUT becomes a 64x1 RAM. Assemble a few and you have a small, wide, everywhere-available memory that lives inside the logic it serves.
The decisive differences
| Block RAM | Distributed RAM | |
|---|---|---|
| Best size | KBytes and up | up to a few hundred bits deep |
| Read | synchronous (1-cycle latency) | asynchronous (same-cycle) |
| Location | fixed columns on the die | anywhere, inside your logic |
| Cost model | whole blocks | LUTs you'd otherwise use for logic |
| Ports | true dual-port | 1 write + 1-few reads |
The sleeper issue is the read latency. BRAM reads are registered — address in this cycle, data next cycle. Distributed RAM reads are combinational — data falls out the same cycle, straight into your logic. That single difference decides most close calls:
- A register file needing same-cycle read-after-decode? Distributed.
- A FIFO buffer of 2K samples? BRAM, obviously.
- A 16-entry coefficient table feeding a DSP? Either works; distributed keeps it adjacent to the multiplier.
What the tools decide (and how to overrule them)
Synthesizers use depth thresholds: tiny memories → distributed, big ones → BRAM, with a gray zone (roughly 64–1024 entries) decided by heuristics. When you disagree:
(* ram_style = "block" *) reg [7:0] mem [0:511]; // AMD/Xilinx
(* ram_style = "distributed" *) reg [7:0] mem [0:63];
// Intel: (* ramstyle = "M20K" *) or "MLAB"; Quartus also honors logic
Two classic interventions:
- Out of BRAMs? Push the small-but-many memories (FIFOs of depth 32, lookup tables) to distributed and reclaim whole blocks — check the tiling math with the estimator first.
- Fmax suffering on an async-read path? That combinational read is in your critical path; move to BRAM and absorb the pipeline stage, or register the distributed read yourself.
The inference recipe that always works
For BRAM, keep the template simple and registered — the Verilog cheatsheet version:
always @(posedge clk) begin
if (we) mem[waddr] <= wdata;
rdata <= mem[raddr]; // registered read → BRAM
end
Read the write port and read port in the same always block for
write-first behavior, separate blocks for read-first — and if you need a
memory with asynchronous read (assign rdata = mem[raddr];), you've just
asked for distributed RAM whether you meant to or not. On iCE40, note
there's no distributed RAM at all: small memories cost real EBR blocks
or flops, which changes the math for tiny FPGAs
(board picker has the capacities).
Rule-of-thumb summary
- < ~64 entries, or need same-cycle read: distributed.
- ≥ ~512 entries, or dual-port, or wide+deep: block RAM, output register on.
- In between: whichever resource you have more of — and make the
choice explicit with
ram_style, so a tool-version upgrade doesn't quietly rebalance your design.
Memory placement is one of the few areas where a one-line attribute can free 20% of your chip. Worth the five minutes.