Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) by Idan3011 · Pull Request #810 · openai/parameter-golf

Idan3011 · 2026-03-26T04:39:04Z

Phrase Cache + N-gram Backoff + EMA-GPU + Pre-Enrichment + XSA

val_bpb: 0.2722 (phrase cache + multi-order n-gram backoff 2-11, per-order adaptive alpha + PE confidence)
| 1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s

Progress

	v1	v2	v3	v4	v5	v6	v7	v8	v9	v10 (this)
val_bpb	1.1855	1.1709	1.1668	1.1629	1.0689	0.9784	0.9408	0.9393	0.2995	0.2722
Eval method	sliding	sliding	sliding	sliding	5-gram	2-7 backoff	2-11 backoff	+PE conf	shared
cache	+phrase cache

Key Contributions

Long Phrase Cache (eval-only, -0.027 BPP)

Variable-length suffix matching at lengths [48, 36, 28, 20, 16] catches verbatim repetition (boilerplate,
menus, legal text) that fixed-order n-grams miss. Cascaded ON TOP of n-gram mixing.

Multiplicative rolling hash per suffix length, precomputed on GPU (int32)
Two tables per length: context counts + pair counts (4M buckets each, GPU)
Longest-match-first: try length 48, fall back to 36, 28, 20, 16
Entropy-adaptive alpha: longer matches get higher weight, high model entropy increases trust
Score-first: tables updated AFTER scoring each chunk
~5s hash precomputation + ~5s eval overhead = negligible

Improvement: 0.2995 → 0.2722 = -0.027 BPP

Two-Phase Shared N-gram Cache (-0.64 BPP)

Phase 1 (parallel): each GPU scores its share of sliding windows.
Phase 2 (global): all scored data gathered, sorted by position, single global n-gram cache built sequentially.

Multi-order backoff: orders 11→10→...→2, first hit with count≥2 wins
Per-order entropy centers: high orders trusted at lower entropy
Per-order weights: orders 5-11 boosted, 2-3 suppressed
Pre-enrichment confidence modulation: PE delta modulates alpha

EMA on GPU (37% faster training)

Step time: 64.7ms (vs 101ms before). 9,268 steps in 600s.

GELU Pre-Enrichment (512→768→512)

Wider nonlinear transformation before transformer blocks.

XSA (Exclusive Self Attention) on Last 4 Layers

Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir).

Additional Techniques

SmearGate: Per-dim gate blending each token with previous token.
BigramHash (2048x128): Hash-table embedding for token bigrams.
EMA (decay=0.997): Quant gap 0.004.
Int6 QAT + lzma: 14.94 MB artifact.

Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.

What Didn't Work

Hedge mixer: online learned weights worse than hand-tuned alpha (0.3265 vs 0.2722).
Learned mixer head (Linear 512→11): gate didn't generalize from training to eval data (0.3310).
TTT (AdamW, score-first): destroyed quantized weights (0.3528).
11L + int5 MLP: quant gap 0.021 wiped out 11L advantage (0.3108).
Log-odds mixing: near-zero n-gram probs create catastrophic logits.
SSE post-correction: always pushes predictions toward 1.0.
Orders 12-13: no improvement over 2-11.

Compliance

Score-first: n-gram + phrase caches updated AFTER scoring each chunk
Backward-looking: caches at position p contain only tokens 0..p-1
No oracle selection: alpha depends on model entropy and n-gram order, never on ground truth
No training data access during eval
No two-pass rescoring

Reproduction

All defaults baked in. No env vars needed.

  python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
  torchrun --standalone --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~211s eval.

Key Metrics

Metric	Value
val_bpb (phrase + n-gram)	0.2722
Sliding window val_bpb	1.1478
Post-quant val_bpb (standard)	1.1690
Pre-quant val_bpb	1.1646
Quant gap	0.004
Training time	600,031ms (9,268 steps at 64.7ms)
Eval time	211,362ms
Peak memory	13,058 MiB
Artifact size	14,942,971 bytes
Model parameters	25,254,992

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
SmearGate + BigramHash — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (@aquariouseworkman)
XSA — arXiv:2603.09078; GQA-aware PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (@unnir)
EMA + GPTQ-lite + warmdown tuning — PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush)
N-gram eval cache — concept PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659 (@deanbrr); fixed 5-gram PR Podracing: 1.0461 BPB (3-seed mean) — 5-gram eval + LeakyReLU² #706 (@newjordan); multi-order
entropy-adaptive PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (@Asukabot0)
Shared GPU n-gram cache — PR Record: 0.2292 BPB — Dirichlet-Multinomial Smoothing + Distributed Prefill + 15-Gram + EBLS #796 (@Robby955); PR Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800 (@newjordan); PR Record: Chunk-Based N-gram Backoff + Score-First TTT (0.295 BPB) #809 (@AayushBaniya2006)
Per-order adaptive alpha — PR Record: Order-Adaptive Entropy Gating + BackoffNgramMixer (val_bpb=0.5466) #798 (@travispchen); PR Record: X-WING — Shared N-gram Tables + Cubric (val_bpb=0.5644) #800 (@newjordan)
Long phrase cache — PR Record: PhraseCache + OrderAdaptive N-gram + RegimeTracker — val_bpb 0.1003 (3-seed mean) #880 (@RoyiRa)
Overtone init — modded-nanogpt baseline
GELU Pre-Enrichment — original to this submission
EMA on GPU — original to this submission
Pre-Enrichment Confidence Modulation — original to this submission

Update Log

v1 (1.1855): int8+zlib, MLP 2x, seq 1024
v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
v4 (1.1629): + XSA on last 4 layers
v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
v7 (0.9408): + extended to orders 2-11 + steeper alpha
v8 (0.9393): + pre-enrichment confidence modulation
v9 (0.2995): + two-phase shared cache + per-order adaptive alpha (3-seed: 0.2995)
v10 (0.2722): + long phrase cache (lengths 48, 36, 28, 20, 16)

Idan3011 force-pushed the submission branch from 7e07f4d to 7a03447 Compare March 26, 2026 04:40

Record: val_bpb=0.9393

7ff8cf7

Idan3011 force-pushed the submission branch from 7a03447 to 7ff8cf7 Compare March 26, 2026 04:43

Record: two-phase shared n-gram cache (val_bpb=0.3001, verified 1xGPU)

c504e5c

Idan3011 changed the title ~~Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393)~~ Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001) Mar 26, 2026

notapplica mentioned this pull request Mar 26, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Record: 3-seed validated val_bpb=0.2995 (std 0.0016)

003757b

Idan3011 changed the title ~~Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001)~~ Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) Mar 26, 2026

Record: 3-seed validated val_bpb=0.2995 (std 0.0016)

bdd17d5

Idan3011 changed the title ~~Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed)~~ Record: Per-Order Adaptive Alpha + N-gram Backoff (val_bpb=0.2995, 3-seed) Mar 26, 2026

Record: phrase cache + n-gram backoff (val_bpb=0.2722)

a6db019

Idan3011 changed the title ~~Record: Per-Order Adaptive Alpha + N-gram Backoff (val_bpb=0.2995, 3-seed)~~ Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722)#810

Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722)#810
Idan3011 wants to merge 5 commits intoopenai:mainfrom
Idan3011:submission

Idan3011 commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Idan3011 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Phrase Cache + N-gram Backoff + EMA-GPU + Pre-Enrichment + XSA

Progress

Key Contributions

Long Phrase Cache (eval-only, -0.027 BPP)

Two-Phase Shared N-gram Cache (-0.64 BPP)

EMA on GPU (37% faster training)

GELU Pre-Enrichment (512→768→512)

XSA (Exclusive Self Attention) on Last 4 Layers

Additional Techniques

What Didn't Work

Compliance

Reproduction

Key Metrics

Credits

Update Log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Idan3011 commented Mar 26, 2026 •

edited

Loading