Block RAM vs distributed RAM: where your memory actually goes

Write reg [7:0] mem [0:255]; and the synthesizer has a choice to make: build it from dedicated block RAM, or assemble it from the LUTs themselves (distributed RAM). The two have different superpowers, and knowing which you'll get — and how to steer the choice — is one of those small skills that pays off in every design.

Two kinds of silicon

Block RAM is dedicated SRAM: 36 Kb blocks (AMD), 20 Kb M20K (Intel), 18 Kb EBR (ECP5). True dual-port, optional output registers, byte enables. You have a fixed number of blocks and each is all-or-nothing — a 40-bit-deep scratchpad consumes a whole block, which is why the BRAM estimator exists.

Distributed RAM exploits a beautiful fact: a LUT is a small memory (that's the glossary definition). Fabrics that support it (AMD's SLICEM, some others) let you write to it at runtime — a 6-LUT becomes a 64x1 RAM. Assemble a few and you have a small, wide, everywhere-available memory that lives inside the logic it serves.

The decisive differences

	Block RAM	Distributed RAM
Best size	KBytes and up	up to a few hundred bits deep
Read	synchronous (1-cycle latency)	asynchronous (same-cycle)
Location	fixed columns on the die	anywhere, inside your logic
Cost model	whole blocks	LUTs you'd otherwise use for logic
Ports	true dual-port	1 write + 1-few reads

The sleeper issue is the read latency. BRAM reads are registered — address in this cycle, data next cycle. Distributed RAM reads are combinational — data falls out the same cycle, straight into your logic. That single difference decides most close calls:

A register file needing same-cycle read-after-decode? Distributed.
A FIFO buffer of 2K samples? BRAM, obviously.
A 16-entry coefficient table feeding a DSP? Either works; distributed keeps it adjacent to the multiplier.

What the tools decide (and how to overrule them)

Synthesizers use depth thresholds: tiny memories → distributed, big ones → BRAM, with a gray zone (roughly 64–1024 entries) decided by heuristics. When you disagree:

(* ram_style = "block" *)       reg [7:0] mem [0:511];   // AMD/Xilinx
(* ram_style = "distributed" *) reg [7:0] mem [0:63];
// Intel: (* ramstyle = "M20K" *) or "MLAB"; Quartus also honors logic

Two classic interventions:

Out of BRAMs? Push the small-but-many memories (FIFOs of depth 32, lookup tables) to distributed and reclaim whole blocks — check the tiling math with the estimator first.
Fmax suffering on an async-read path? That combinational read is in your critical path; move to BRAM and absorb the pipeline stage, or register the distributed read yourself.

The inference recipe that always works

For BRAM, keep the template simple and registered — the Verilog cheatsheet version:

always @(posedge clk) begin
    if (we) mem[waddr] <= wdata;
    rdata <= mem[raddr];        // registered read → BRAM
end

Read the write port and read port in the same always block for write-first behavior, separate blocks for read-first — and if you need a memory with asynchronous read (assign rdata = mem[raddr];), you've just asked for distributed RAM whether you meant to or not. On iCE40, note there's no distributed RAM at all: small memories cost real EBR blocks or flops, which changes the math for tiny FPGAs (board picker has the capacities).

Rule-of-thumb summary

< ~64 entries, or need same-cycle read: distributed.
≥ ~512 entries, or dual-port, or wide+deep: block RAM, output register on.
In between: whichever resource you have more of — and make the choice explicit with ram_style, so a tool-version upgrade doesn't quietly rebalance your design.

Memory placement is one of the few areas where a one-line attribute can free 20% of your chip. Worth the five minutes.