The timing-closure triage checklist: what to do when slack goes negative

The timing report says -0.42 ns and the demo is Friday. Timing closure rewards systematic triage over heroics — the fastest engineers I know all work the same ladder, in the same order. Here it is.

Step 0: trust, but verify the constraints

Before optimizing anything, confirm the failure is real:

check_timing — any unconstrained endpoints? Unconstrained paths are unchecked paths; the report may be missing worse problems.
report_exceptions -ignored — a constraint that matches zero objects fails silently. Renamed a module lately?
report_clock_interaction — are cross-domain paths being timed that shouldn't be? A missing set_clock_groups makes the tool chase impossible CDC paths while real ones starve.
Is the clock even right? A 100 MHz constraint on a 125 MHz board clock closes beautifully and fails in hardware. (Sanity-check with the timing converter.)

A quarter of "timing problems" end here, fixed by making the constraints describe reality. Full command syntax in the Vivado TCL cheatsheet.

Step 1: read the failing path like a story

report_timing -max_paths 10 and actually read it:

Logic levels vs routing. Delay split tells you the fix. Many logic levels → restructure or pipeline. Mostly routing on few levels → placement/congestion problem, different medicine.
What are the endpoints? A path into a BRAM/DSP may just need the primitive's optional output register enabled — a checkbox, not a redesign.
Is it one path or a family? Ten failures through one fat comparator is one fix. Ten unrelated paths at -0.05 ns is a congestion/seed story.

Step 2: the cheap wins, in order of cheapness

Enable primitive output registers. BRAMs and DSPs have free pipeline stages; using them is a netlist attribute or one line of RTL.
Retime around the critical path. Move a register across a chunk of logic; total latency unchanged, worst path halved.
Reduce fanout of the hot net. max_fanout attributes or manual register duplication for enables/resets that fan to thousands of loads (see reset strategies — often it's the reset net).
Pipeline the arithmetic. A 64-bit add-compare-select in one cycle at 300 MHz is asking a lot; two stages make it easy. The multipliers lesson shows the area/time dial.
One-hot the hot FSM. Wide state decoders shrink to single-bit tests (FSM generator will re-emit yours one-hot).

Step 3: the structural moves

Check utilization before blaming the router. Above ~80% LUT or BRAM usage, congestion dominates; the BRAM estimator shows if a different memory shape frees blocks.
Floorplan the stable stuff. Pblocks around interface logic near their I/O pins stop the placer from smearing them across the die.
Question the clock itself. Does that block need 250 MHz, or does it need 2x the throughput at 125 MHz (process two items per cycle)? Wider-and-slower is the FPGA's favorite trade.
Speed grade / part. Sometimes the honest answer is -2 instead of -1 (decode yours) — engineering hours cost more than the part-price delta.

Step 4: the last 50 picoseconds

Multiple place-and-route seeds, phys_opt_design iterations, tool-directive exploration. Legitimate — but if you're here every build, the design has structural debt from steps 2–3. Margin, not luck, is the goal: close with 5% to spare and the next RTL change doesn't reopen the hunt.

The habit that prevents the fire drill

Run timing analysis from the first day the RTL elaborates, not the week before the demo. Negative slack on day 2 is a design conversation; negative slack on day 90 is an emergency. Pair it with report_cdc and the constraints cheatsheets, and Friday demos get a lot more relaxing.