This is the KV Cache Engine (KVCE) block of the LonghornSilicon LLM inference accelerator — block 2 of four targeting TSMC 16FFC tape-out. It is a streaming compress-on-write / decompress-on-read engine for transformer KV-cache tensors, sitting between the ACU (attention compute unit) and the memory hierarchy.
The block stays; the codec it implements was replaced and is now fully integrated, synthesizable, and signed off. TurboQuant+ (PolarQuant + QJL + Walsh–Hadamard rotation) was retired 2026-06-22: it reaches ~3.5× compression but with a −0.10 HellaSwag acc_norm collapse on GQA models (0.316 vs 0.420 FP16 on Qwen2-0.5B). Root cause: KV quant error on GQA is dominated by a few fixed high-magnitude key channels, and the rotation step delocalizes that error so no per-token protection catches it.
The successor codec is ChannelQuant — per-channel-key INT4 / per-token-value INT4 / static outlier-channel isolation (the KIVI/KVQuant recipe). The algorithm is prior art (KIVI ICML'24, KVQuant 2024); the contribution of this block is the streaming silicon implementation.
Status (master, 2026-07-03): DONE.
- RTL fully wired into the top (
kv_cache_engine.sv): keys → grouped per-channel INT4 (cq_key_path), values → per-token INT4 (cq_value_path), outlier lane + unified per-channel SRAM record. All cores serialized (one shared scale / quant / dequant), noreal, no latches, checker-clean.- All CI gates green — functional, synthesis (FF-count), formal RTL≡netlist equivalence, reference-model parity, and OpenLane Sky130 sign-off.
- Verified end-to-end on Qwen2 (below): near-FP16 accuracy at ~4 bits/value.
Retired TurboQuant+ datapath (archived, full history) branch legacy/turboquant-plusAlgorithm spec + reference model + golden vectors ../channelquant/(frozen contract v0.2)Per-milestone lab notebook NOTES.md
| What | Streaming compress/decompress engine for transformer KV-cache tensors |
| Why | Cuts KV-cache DRAM bandwidth ~3.8× (near-lossless), enabling longer context in the same memory budget |
| How | ChannelQuant — per-channel INT4 keys (grouped, G=128) + per-token INT4 values + static top-k FP16 outlier-channel isolation (CQ-4+) |
| K/V asymmetry | K: per-channel scale over a token group (the GQA-critical axis); V: per-token scale |
| Tiers | CQ-8 (per-token INT8 K+V), CQ-4 (per-channel INT4 K / per-token INT4 V), CQ-4+ (CQ-4 with k=2 FP16 outlier channels) |
| Verified | RTL bit-exact vs golden (sim_kpath/sim_top), 3-way Python↔C++↔SV parity, all CI gates green incl. Sky130 sign-off |
| Accuracy | HellaSwag acc_norm within ~0.5–1.6 pt of FP16 on Qwen2-0.5B/1.5B (see below) |
| Status | Tape-out target Q3/Q4 2026 via TSMC University Program 16FFC |
The GQA accuracy problem is that a few fixed key channels carry most of the quant error. ChannelQuant scales per channel on the key path (so those channels get their own scale) and isolates the worst top-k as FP16 outliers:
Key path — per-channel INT4 (cq_key_path)
- Buffer a group of G=128 key tokens (
residual_buffer). - Take the per-channel max over the group (
amax_unit, key mode) and freeze D per-channel FP16 scales (scale_bank). - Quantize each keep-channel to INT4; the top-k outlier channels (CQ-4+, k=2 from a static calibrated ROM mask) are held FP16 instead.
Value path — per-token INT4 (cq_value_path)
- Per-token amax → FP16 scale → INT4 (INT8 for the CQ-8 tier). No grouping.
Unified per-channel SRAM record {tag, D×FP16 field, D×INT4 code}
- Keep channel →
{group scale, INT4 code}; outlier channel →{raw FP16, code +1}so decompresscode · fieldwidens the FP16 exactly — no separate sidecar region and no read-side mask. Read-back reuses the same per-channel dequant, tag-muxed against the value dequant.
Area/timing: each compute core (scale / quant / dequant) carries an fp16 divider, so instead of D parallel units the datapath serializes one shared unit across the D channels (a single divide cone is what stalled place-and-route). This is bit-exact with the behavioral oracle and place-and-routes at a real clock.
HellaSwag acc_norm, n=1000, ChannelQuant K̂/V̂ inserted into the model's KV path
(reproduced this repo via the frozen ../channelquant reference):
| Model | FP16 | CQ-4 (Δ) | CQ-4+ (Δ) | bits/value |
|---|---|---|---|---|
| Qwen2-0.5B (D=64) | 0.4260 | 0.4170 (−0.009) | 0.4220 (−0.004) | ~4.19 / 4.38 |
| Qwen2-1.5B (D=128) | 0.5210 | 0.5050 (−0.016) | 0.5130 (−0.008) | ~4.13 / 4.22 |
Both tiers clear the ≤0.02 acceptance gate at ~4 bits/value (≈3.8× KV compression); the CQ-4+ outlier lane earns its keep at D=128. Combined with the ACU precision controller (INT8/FP16-routed S·V) the system holds accuracy at FP16 (no measurable loss on Qwen2-0.5B).
┌──────────────────────────────────────────────────────────────────────┐
│ LonghornSilicon LLM Inference Accelerator (16FFC) │
│ │
│ ┌──────────────────┐ │
│ │ ACU (block 1) │ Q·Kᵀ scores │
│ │ precision │──────────────────┐ │
│ │ controller │ ▼ │
│ │ INT8 vs FP16 │ ┌────────────────────┐ │
│ │ gate per tile │ │ Token Importance │ │
│ │ + INT8/FP16 MAC │ │ Unit (block 3) │ │
│ └────────┬─────────┘ └─────────┬──────────┘ │
│ │ K, V │ tier signals │
│ ▼ ▼ │
│ ┌─────────────────────────┐ │
│ │ KV Cache Engine │ ChannelQuant compress on writes, │
│ │ (this repo) │ decompress on reads: │
│ │ │ K → per-channel INT4 (+outlier FP16) │
│ │ │ V → per-token INT4 │
│ └─────────────┬───────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ ┌──────────────────────┐ │
│ │ Memory Hierarchy Ctrl. │◀─▶│ Off-chip LPDDR5 │ │
│ │ (block 4) │ │ (cold KV + weights) │ │
│ └─────────────────────────┘ └──────────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
| Block | This repo? | Role |
|---|---|---|
| ACU (Attention Compute Unit) | no (repo) | Decides INT8 vs FP16 per tile, runs the MAC array |
| KV Cache Engine | this repo | ChannelQuant compress on write, decompress on read |
| Token Importance Unit | not yet | Tracks attention weight per cached token → keep / demote / evict |
| Memory Hierarchy Controller | not yet | Routes between L1 SRAM / L2 eDRAM / off-chip LPDDR5 |
The two live blocks coordinate at attention time: KVCE decompresses K/V → the ACU computes Q·Kᵀ scores → the precision controller routes INT8/FP16 → the MAC array runs the matmul.
kv-cache-engine/
├── rtl/
│ ├── kv_cache_engine.sv # Top: AXI-Lite CSR + AXI-Stream, ChannelQuant FSM + SRAM
│ ├── cq_key_path.sv # Grouped per-channel INT4 key codec (serialized)
│ ├── cq_value_path.sv # Per-token INT4/INT8 value codec (serialized)
│ ├── cq_units_syn.sv # Synthesizable fp16 cores: scale / quant / dequant
│ ├── cq_units.sv, cq_fp_pkg.sv # Behavioral `real` oracle (for the parity TBs)
│ ├── amax_unit.sv # Per-token / per-channel max reduction
│ ├── residual_buffer.sv # G-token group hold (key path)
│ ├── scale_bank.sv # D per-channel scale bank (key path)
│ ├── sram_controller.sv # Behavioral SRAM (reg array)
│ ├── tb/ # sim, sim_realdata, sim_cq, sim_amax, sim_vpath,
│ │ # sim_kpath, sim_top, sim_syn (+ vendored golden vectors)
│ ├── constraints/, *.tcl, synth.ys, Makefile
│ └── KEYPATH_HANDOFF.md, TEARDOWN.md, NOTES pointers
├── openlane/kv_cache_engine/ # LibreLane / OpenROAD Sky130 flow (+ src/ symlinks)
├── sw/reference_model/ # channelquant_ref.{hpp,cpp} (ChannelQuant C++ ref) + tests
├── docs/ # ISA spec, reference-model API, sw overview, CI docs
├── NOTES.md # dated lab notebook (every parity/synth result)
└── .github/workflows/ci.yml # thin caller → shared block-ci reusable workflow
The retired TurboQuant+ modules (rotation_unit, qjl_unit, quantizer,
packer, decompressor, norm_unit) live on branch legacy/turboquant-plus.
RTL (this host, iverilog 12.0 / yosys):
make sim_top— per-token INT4 V and grouped CQ-4+ keys bit-exact through the AXI FSM + SRAM (D=64, G=64, k=2).make sim_kpath— 6/6 bit-exact (serialized key path: scale + INT4 payload + K̂ + sidecar, full and partial groups).make sim sim_realdata sim_vpath sim_amax sim_syn sim_cq— all green.yosys proc; checkon the top — 0 "conflicting with a constant", 0 latches, 0 CHECK problems, noreal.
CI gates (all green):
| Gate | What it does | Status |
|---|---|---|
| 1. RTL functional verification | Directed + replay + parity iverilog TBs | ✅ |
| 3. RTL synthesis (Yosys) | Synth + FF-count assertion | ✅ |
| 4. Formal equivalence | RTL ≡ post-synth netlist (Yosys induction) | ✅ |
| 5. Reference model tests | C++ + Python bit-exact (3-way parity) | ✅ |
| 6. OpenLane Sky130 sign-off | Full Sky130 PnR + DRC/LVS | ✅ |
| 2 / 7 / 8 | coverage / paper / Cadence 16FFC | disabled |
The synthesis/formal/OpenLane gates run a small flop-based gate proxy of the
default params (the SRAM and residual buffer are behavioral flip-flops, no Sky130
macro); the real head-dim / group / depth are set per-instantiation (every TB
overrides them). See the FF-count note in .github/workflows/ci.yml.
Toolchain: iverilog 12.0 + yosys (CPU-only). On a fresh host see the
per-host EDA-env notes; . rtl/eda-env.sh puts both on PATH.
cd rtl
make sim_top # top-level ChannelQuant end-to-end (per-token V + grouped keys), bit-exact
make sim_kpath # grouped per-channel INT4 key path, 6/6 bit-exact
make sim_cq # golden-vector parity, all 9 vectors (behavioral oracle)
make sim sim_realdata sim_vpath sim_amax sim_syn # the rest of the board
# reference-model parity (C++ + Python):
cd ../sw/reference_model && make test-all
# synthesis / Sky130 sign-off:
cd ../../rtl && yosys -s synth.ys
cd ../openlane/kv_cache_engine && librelane --docker-no-tty --dockerized config.jsonEnd-to-end accuracy on Qwen2 is reproduced from the frozen ../channelquant
reference (analysis/c23_headline.py, HellaSwag); the algorithm accuracy claims
live in that repo's contract.
| Offset | Name | Access | Description |
|---|---|---|---|
0x00 |
CTRL |
RW | bit[0]: soft_reset, bit[1]: enable |
0x04 |
STATUS |
R | bit[0]: idle, sram_full |
0x08 |
INFO_DIM |
R | head dim D |
0x0C |
INFO_TIER |
R | 0=CQ-8, 1=CQ-4, 2=CQ-4+ |
0x10 |
INFO_GROUP |
R | key group size G (contract §3.1) |
0x14 |
INFO_SRAM_DEPTH |
R | SRAM entries |
0x18 |
INFO_CR_K |
R | key compression ratio (8.8 fixed-point) |
0x1C |
INFO_CR_V |
R | value compression ratio (8.8 fixed-point) |
0x20 |
INFO_VERSION |
R | ISA version (0x00020000 = v0.2) |
0x24 |
OCCUPANCY |
R | valid SRAM entries |
0x28 |
WRITE_ADDR |
RW | target write / group-base address |
0x2C |
READ_ADDR |
RW | target read address (write launches a decompress) |
0x30 |
KV_SELECT |
RW | 0=key, 1=value |
0x34 |
IRQ_MASK |
RW | interrupt enable mask |
0x38 |
IRQ_STATUS |
R/W1C | interrupt pending status |
0x3C |
INFO_OUTLIER_K |
R | top-k FP16 outlier channels (CQ-4+) |
0x40 |
INFO_SCALE_DEPTH |
R | per-channel scale-bank depth (= D) |
0x44 |
INFO_RESID_DEPTH |
R | residual-buffer depth (= G) |
Full ISA specification: docs/isa/kv_cache_engine_isa.pdf.
- Codec pivot TurboQuant+ → ChannelQuant (algorithm de-risked in
../channelquant) - Synthesizable fp16 compute cores (scale / quant / dequant), bit-exact vs oracle
- Per-token value path + grouped per-channel INT4 key path (serialized)
- Outlier-channel lane (CQ-4+) + static ROM mask
- Top-level integration (AXI-Lite CSR + AXI-Stream), unified per-channel SRAM record
- Directed / replay / parity / top-stream testbenches — all green, bit-exact
- 3-way Python↔C++↔SV reference parity
- Yosys synthesis + FF-count + formal RTL≡netlist equivalence (CI green)
- OpenLane Sky130 sign-off (CI green)
- End-to-end accuracy on Qwen2-0.5B / 1.5B (near-FP16 at ~4 bits)
- Partial-group flush (g<G) top stream-framing (datapath already supports it)
- TSMC 16FFC sign-off on Cadence (waiting on PDK access)
- ZCU102/104 FPGA prototype (Vivado, when board arrives)
- Integration with Token Importance Unit, Memory Hierarchy Controller
- Full-chip tape-out via TSMC University Program shuttle (target Q3/Q4 2026)
@misc{kv_cache_engine_2026,
title = {KV Cache Engine: A Streaming Silicon Implementation of ChannelQuant
(Per-Channel INT4) KV-Cache Compression},
author = {LonghornSilicon},
year = {2026},
url = {https://github.com/LonghornSilicon/kv-cache-engine}
}The ChannelQuant codec follows the per-channel-key / per-token-value + outlier recipe of KIVI (Liu et al., ICML 2024) and KVQuant (Hooper et al., 2024); this block contributes the streaming silicon implementation. The open hardware flow uses Yosys, OpenROAD, LibreLane, and the SkyWater Sky130 PDK.