KV Cache Engine

This is the KV Cache Engine (KVCE) block of the LonghornSilicon LLM inference accelerator — block 2 of four targeting TSMC 16FFC tape-out. It is a streaming compress-on-write / decompress-on-read engine for transformer KV-cache tensors, sitting between the ACU (attention compute unit) and the memory hierarchy.

✅ ChannelQuant revamp COMPLETE — codec: TurboQuant+ → ChannelQuant

The block stays; the codec it implements was replaced and is now fully integrated, synthesizable, and signed off. TurboQuant+ (PolarQuant + QJL + Walsh–Hadamard rotation) was retired 2026-06-22: it reaches ~3.5× compression but with a −0.10 HellaSwag acc_norm collapse on GQA models (0.316 vs 0.420 FP16 on Qwen2-0.5B). Root cause: KV quant error on GQA is dominated by a few fixed high-magnitude key channels, and the rotation step delocalizes that error so no per-token protection catches it.

The successor codec is ChannelQuant — per-channel-key INT4 / per-token-value INT4 / static outlier-channel isolation (the KIVI/KVQuant recipe). The algorithm is prior art (KIVI ICML'24, KVQuant 2024); the contribution of this block is the streaming silicon implementation.

Status (master, 2026-07-03): DONE.

RTL fully wired into the top (kv_cache_engine.sv): keys → grouped per-channel INT4 (cq_key_path), values → per-token INT4 (cq_value_path), outlier lane + unified per-channel SRAM record. All cores serialized (one shared scale / quant / dequant), no real, no latches, checker-clean.

All CI gates green — functional, synthesis (FF-count), formal RTL≡netlist equivalence, reference-model parity, and OpenLane Sky130 sign-off.

Verified end-to-end on Qwen2 (below): near-FP16 accuracy at ~4 bits/value.

Retired TurboQuant+ datapath (archived, full history) branch legacy/turboquant-plus

Algorithm spec + reference model + golden vectors ../channelquant/ (frozen contract v0.2)

Per-milestone lab notebook NOTES.md

TL;DR


What	Streaming compress/decompress engine for transformer KV-cache tensors
Why	Cuts KV-cache DRAM bandwidth ~3.8× (near-lossless), enabling longer context in the same memory budget
How	ChannelQuant — per-channel INT4 keys (grouped, G=128) + per-token INT4 values + static top-k FP16 outlier-channel isolation (CQ-4+)
K/V asymmetry	K: per-channel scale over a token group (the GQA-critical axis); V: per-token scale
Tiers	CQ-8 (per-token INT8 K+V), CQ-4 (per-channel INT4 K / per-token INT4 V), CQ-4+ (CQ-4 with k=2 FP16 outlier channels)
Verified	RTL bit-exact vs golden (`sim_kpath`/`sim_top`), 3-way Python↔C++↔SV parity, all CI gates green incl. Sky130 sign-off
Accuracy	HellaSwag acc_norm within ~0.5–1.6 pt of FP16 on Qwen2-0.5B/1.5B (see below)
Status	Tape-out target Q3/Q4 2026 via TSMC University Program 16FFC

How ChannelQuant works

The GQA accuracy problem is that a few fixed key channels carry most of the quant error. ChannelQuant scales per channel on the key path (so those channels get their own scale) and isolates the worst top-k as FP16 outliers:

Key path — per-channel INT4 (cq_key_path)

Buffer a group of G=128 key tokens (residual_buffer).
Take the per-channel max over the group (amax_unit, key mode) and freeze D per-channel FP16 scales (scale_bank).
Quantize each keep-channel to INT4; the top-k outlier channels (CQ-4+, k=2 from a static calibrated ROM mask) are held FP16 instead.

Value path — per-token INT4 (cq_value_path)

Per-token amax → FP16 scale → INT4 (INT8 for the CQ-8 tier). No grouping.

Unified per-channel SRAM record {tag, D×FP16 field, D×INT4 code}

Keep channel → {group scale, INT4 code}; outlier channel → {raw FP16, code +1} so decompress code · field widens the FP16 exactly — no separate sidecar region and no read-side mask. Read-back reuses the same per-channel dequant, tag-muxed against the value dequant.

Area/timing: each compute core (scale / quant / dequant) carries an fp16 divider, so instead of D parallel units the datapath serializes one shared unit across the D channels (a single divide cone is what stalled place-and-route). This is bit-exact with the behavioral oracle and place-and-routes at a real clock.

Accuracy — verified end-to-end on Qwen2

HellaSwag acc_norm, n=1000, ChannelQuant K̂/V̂ inserted into the model's KV path (reproduced this repo via the frozen ../channelquant reference):

Model	FP16	CQ-4 (Δ)	CQ-4+ (Δ)	bits/value
Qwen2-0.5B (D=64)	0.4260	0.4170 (−0.009)	0.4220 (−0.004)	~4.19 / 4.38
Qwen2-1.5B (D=128)	0.5210	0.5050 (−0.016)	0.5130 (−0.008)	~4.13 / 4.22

Both tiers clear the ≤0.02 acceptance gate at ~4 bits/value (≈3.8× KV compression); the CQ-4+ outlier lane earns its keep at D=128. Combined with the ACU precision controller (INT8/FP16-routed S·V) the system holds accuracy at FP16 (no measurable loss on Qwen2-0.5B).

How this fits in LonghornSilicon

┌──────────────────────────────────────────────────────────────────────┐
│              LonghornSilicon LLM Inference Accelerator (16FFC)       │
│                                                                      │
│   ┌──────────────────┐                                               │
│   │  ACU (block 1)   │  Q·Kᵀ scores                                  │
│   │  precision       │──────────────────┐                            │
│   │  controller      │                   ▼                           │
│   │  INT8 vs FP16    │          ┌────────────────────┐               │
│   │  gate per tile   │          │ Token Importance    │               │
│   │  + INT8/FP16 MAC │          │ Unit (block 3)      │               │
│   └────────┬─────────┘          └─────────┬──────────┘               │
│            │  K, V                        │ tier signals              │
│            ▼                              ▼                           │
│   ┌─────────────────────────┐                                        │
│   │  KV Cache Engine        │  ChannelQuant compress on writes,      │
│   │  (this repo)            │  decompress on reads:                  │
│   │                         │  K → per-channel INT4 (+outlier FP16)  │
│   │                         │  V → per-token INT4                     │
│   └─────────────┬───────────┘                                        │
│                 ▼                                                     │
│   ┌─────────────────────────┐   ┌──────────────────────┐             │
│   │ Memory Hierarchy Ctrl.  │◀─▶│ Off-chip LPDDR5       │             │
│   │ (block 4)               │   │ (cold KV + weights)   │             │
│   └─────────────────────────┘   └──────────────────────┘             │
└──────────────────────────────────────────────────────────────────────┘

Block	This repo?	Role
ACU (Attention Compute Unit)	no (repo)	Decides INT8 vs FP16 per tile, runs the MAC array
KV Cache Engine	this repo	ChannelQuant compress on write, decompress on read
Token Importance Unit	not yet	Tracks attention weight per cached token → keep / demote / evict
Memory Hierarchy Controller	not yet	Routes between L1 SRAM / L2 eDRAM / off-chip LPDDR5

The two live blocks coordinate at attention time: KVCE decompresses K/V → the ACU computes Q·Kᵀ scores → the precision controller routes INT8/FP16 → the MAC array runs the matmul.

What's in this repo

kv-cache-engine/
├── rtl/
│   ├── kv_cache_engine.sv        # Top: AXI-Lite CSR + AXI-Stream, ChannelQuant FSM + SRAM
│   ├── cq_key_path.sv            # Grouped per-channel INT4 key codec (serialized)
│   ├── cq_value_path.sv          # Per-token INT4/INT8 value codec (serialized)
│   ├── cq_units_syn.sv           # Synthesizable fp16 cores: scale / quant / dequant
│   ├── cq_units.sv, cq_fp_pkg.sv # Behavioral `real` oracle (for the parity TBs)
│   ├── amax_unit.sv              # Per-token / per-channel max reduction
│   ├── residual_buffer.sv        # G-token group hold (key path)
│   ├── scale_bank.sv             # D per-channel scale bank (key path)
│   ├── sram_controller.sv        # Behavioral SRAM (reg array)
│   ├── tb/                       # sim, sim_realdata, sim_cq, sim_amax, sim_vpath,
│   │                             #   sim_kpath, sim_top, sim_syn  (+ vendored golden vectors)
│   ├── constraints/, *.tcl, synth.ys, Makefile
│   └── KEYPATH_HANDOFF.md, TEARDOWN.md, NOTES pointers
├── openlane/kv_cache_engine/     # LibreLane / OpenROAD Sky130 flow (+ src/ symlinks)
├── sw/reference_model/           # channelquant_ref.{hpp,cpp} (ChannelQuant C++ ref) + tests
├── docs/                         # ISA spec, reference-model API, sw overview, CI docs
├── NOTES.md                      # dated lab notebook (every parity/synth result)
└── .github/workflows/ci.yml      # thin caller → shared block-ci reusable workflow

The retired TurboQuant+ modules (rotation_unit, qjl_unit, quantizer, packer, decompressor, norm_unit) live on branch legacy/turboquant-plus.

Verification & results

RTL (this host, iverilog 12.0 / yosys):

make sim_top — per-token INT4 V and grouped CQ-4+ keys bit-exact through the AXI FSM + SRAM (D=64, G=64, k=2).
make sim_kpath — 6/6 bit-exact (serialized key path: scale + INT4 payload + K̂ + sidecar, full and partial groups).
make sim sim_realdata sim_vpath sim_amax sim_syn sim_cq — all green.
yosys proc; check on the top — 0 "conflicting with a constant", 0 latches, 0 CHECK problems, no real.

CI gates (all green):

Gate	What it does	Status
1. RTL functional verification	Directed + replay + parity iverilog TBs	✅
3. RTL synthesis (Yosys)	Synth + FF-count assertion	✅
4. Formal equivalence	RTL ≡ post-synth netlist (Yosys induction)	✅
5. Reference model tests	C++ + Python bit-exact (3-way parity)	✅
6. OpenLane Sky130 sign-off	Full Sky130 PnR + DRC/LVS	✅
2 / 7 / 8	coverage / paper / Cadence 16FFC	disabled

The synthesis/formal/OpenLane gates run a small flop-based gate proxy of the default params (the SRAM and residual buffer are behavioral flip-flops, no Sky130 macro); the real head-dim / group / depth are set per-instantiation (every TB overrides them). See the FF-count note in .github/workflows/ci.yml.

Reproduce

Toolchain: iverilog 12.0 + yosys (CPU-only). On a fresh host see the per-host EDA-env notes; . rtl/eda-env.sh puts both on PATH.

cd rtl
make sim_top      # top-level ChannelQuant end-to-end (per-token V + grouped keys), bit-exact
make sim_kpath    # grouped per-channel INT4 key path, 6/6 bit-exact
make sim_cq       # golden-vector parity, all 9 vectors (behavioral oracle)
make sim sim_realdata sim_vpath sim_amax sim_syn   # the rest of the board

# reference-model parity (C++ + Python):
cd ../sw/reference_model && make test-all

# synthesis / Sky130 sign-off:
cd ../../rtl && yosys -s synth.ys
cd ../openlane/kv_cache_engine && librelane --docker-no-tty --dockerized config.json

End-to-end accuracy on Qwen2 is reproduced from the frozen ../channelquant reference (analysis/c23_headline.py, HellaSwag); the algorithm accuracy claims live in that repo's contract.

Register map (AXI-Lite, ISA v0.2)

Offset	Name	Access	Description
`0x00`	`CTRL`	RW	bit[0]: soft_reset, bit[1]: enable
`0x04`	`STATUS`	R	bit[0]: idle, sram_full
`0x08`	`INFO_DIM`	R	head dim D
`0x0C`	`INFO_TIER`	R	0=CQ-8, 1=CQ-4, 2=CQ-4+
`0x10`	`INFO_GROUP`	R	key group size G (contract §3.1)
`0x14`	`INFO_SRAM_DEPTH`	R	SRAM entries
`0x18`	`INFO_CR_K`	R	key compression ratio (8.8 fixed-point)
`0x1C`	`INFO_CR_V`	R	value compression ratio (8.8 fixed-point)
`0x20`	`INFO_VERSION`	R	ISA version (`0x00020000` = v0.2)
`0x24`	`OCCUPANCY`	R	valid SRAM entries
`0x28`	`WRITE_ADDR`	RW	target write / group-base address
`0x2C`	`READ_ADDR`	RW	target read address (write launches a decompress)
`0x30`	`KV_SELECT`	RW	0=key, 1=value
`0x34`	`IRQ_MASK`	RW	interrupt enable mask
`0x38`	`IRQ_STATUS`	R/W1C	interrupt pending status
`0x3C`	`INFO_OUTLIER_K`	R	top-k FP16 outlier channels (CQ-4+)
`0x40`	`INFO_SCALE_DEPTH`	R	per-channel scale-bank depth (= D)
`0x44`	`INFO_RESID_DEPTH`	R	residual-buffer depth (= G)

Full ISA specification: docs/isa/kv_cache_engine_isa.pdf.

Status & roadmap

Citation

@misc{kv_cache_engine_2026,
  title  = {KV Cache Engine: A Streaming Silicon Implementation of ChannelQuant
            (Per-Channel INT4) KV-Cache Compression},
  author = {LonghornSilicon},
  year   = {2026},
  url    = {https://github.com/LonghornSilicon/kv-cache-engine}
}

Acknowledgments

The ChannelQuant codec follows the per-channel-key / per-token-value + outlier recipe of KIVI (Liu et al., ICML 2024) and KVQuant (Hooper et al., 2024); this block contributes the streaming silicon implementation. The open hardware flow uses Yosys, OpenROAD, LibreLane, and the SkyWater Sky130 PDK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KV Cache Engine

✅ ChannelQuant revamp COMPLETE — codec: TurboQuant+ → ChannelQuant

TL;DR

How ChannelQuant works

Accuracy — verified end-to-end on Qwen2

How this fits in LonghornSilicon

What's in this repo

Verification & results

Reproduce

Register map (AXI-Lite, ISA v0.2)

Status & roadmap

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.github/workflows		.github/workflows
analysis		analysis
docs		docs
findings		findings
openlane/kv_cache_engine		openlane/kv_cache_engine
rtl		rtl
sw		sw
.gitignore		.gitignore
NOTES.md		NOTES.md
README.md		README.md


Retired TurboQuant+ datapath (archived, full history)	branch `legacy/turboquant-plus`
Algorithm spec + reference model + golden vectors	`../channelquant/` (frozen contract v0.2)
Per-milestone lab notebook	`NOTES.md`

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KV Cache Engine

✅ ChannelQuant revamp COMPLETE — codec: TurboQuant+ → ChannelQuant

TL;DR

How ChannelQuant works

Accuracy — verified end-to-end on Qwen2

How this fits in LonghornSilicon

What's in this repo

Verification & results

Reproduce

Register map (AXI-Lite, ISA v0.2)

Status & roadmap

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages