Skip to content

Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722)#810

Open
Idan3011 wants to merge 5 commits intoopenai:mainfrom
Idan3011:submission
Open

Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722)#810
Idan3011 wants to merge 5 commits intoopenai:mainfrom
Idan3011:submission

Conversation

@Idan3011
Copy link

@Idan3011 Idan3011 commented Mar 26, 2026

Phrase Cache + N-gram Backoff + EMA-GPU + Pre-Enrichment + XSA

val_bpb: 0.2722 (phrase cache + multi-order n-gram backoff 2-11, per-order adaptive alpha + PE confidence)
| 1.1478 (sliding window) | 14.94 MB | 8xH100 SXM, 600s


Progress

v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 (this)
val_bpb 1.1855 1.1709 1.1668 1.1629 1.0689 0.9784 0.9408 0.9393 0.2995 0.2722
Eval method sliding sliding sliding sliding 5-gram 2-7 backoff 2-11 backoff +PE conf shared
cache +phrase cache

Key Contributions

Long Phrase Cache (eval-only, -0.027 BPP)

Variable-length suffix matching at lengths [48, 36, 28, 20, 16] catches verbatim repetition (boilerplate,
menus, legal text) that fixed-order n-grams miss. Cascaded ON TOP of n-gram mixing.

  • Multiplicative rolling hash per suffix length, precomputed on GPU (int32)
  • Two tables per length: context counts + pair counts (4M buckets each, GPU)
  • Longest-match-first: try length 48, fall back to 36, 28, 20, 16
  • Entropy-adaptive alpha: longer matches get higher weight, high model entropy increases trust
  • Score-first: tables updated AFTER scoring each chunk
  • ~5s hash precomputation + ~5s eval overhead = negligible

Improvement: 0.2995 → 0.2722 = -0.027 BPP

Two-Phase Shared N-gram Cache (-0.64 BPP)

Phase 1 (parallel): each GPU scores its share of sliding windows.
Phase 2 (global): all scored data gathered, sorted by position, single global n-gram cache built sequentially.

  • Multi-order backoff: orders 11→10→...→2, first hit with count≥2 wins
  • Per-order entropy centers: high orders trusted at lower entropy
  • Per-order weights: orders 5-11 boosted, 2-3 suppressed
  • Pre-enrichment confidence modulation: PE delta modulates alpha

EMA on GPU (37% faster training)

Step time: 64.7ms (vs 101ms before). 9,268 steps in 600s.

GELU Pre-Enrichment (512→768→512)

Wider nonlinear transformation before transformer blocks.

XSA (Exclusive Self Attention) on Last 4 Layers

Removes self-value bias via orthogonal projection (arXiv:2603.09078, GQA-aware PR #265 @unnir).


Additional Techniques

  • SmearGate: Per-dim gate blending each token with previous token.
  • BigramHash (2048x128): Hash-table embedding for token bigrams.
  • EMA (decay=0.997): Quant gap 0.004.
  • Int6 QAT + lzma: 14.94 MB artifact.

Architecture: 10L, 512d, 8H/4KV GQA, MLP 3x, tied embeddings, U-Net skip connections. Training: Muon+AdamW,
WD=0.04, matrix_lr=0.025, warmdown=3500, batch=524K, seq=2048.


What Didn't Work

  • Hedge mixer: online learned weights worse than hand-tuned alpha (0.3265 vs 0.2722).
  • Learned mixer head (Linear 512→11): gate didn't generalize from training to eval data (0.3310).
  • TTT (AdamW, score-first): destroyed quantized weights (0.3528).
  • 11L + int5 MLP: quant gap 0.021 wiped out 11L advantage (0.3108).
  • Log-odds mixing: near-zero n-gram probs create catastrophic logits.
  • SSE post-correction: always pushes predictions toward 1.0.
  • Orders 12-13: no improvement over 2-11.

Compliance

  • Score-first: n-gram + phrase caches updated AFTER scoring each chunk
  • Backward-looking: caches at position p contain only tokens 0..p-1
  • No oracle selection: alpha depends on model entropy and n-gram order, never on ground truth
  • No training data access during eval
  • No two-pass rescoring

Reproduction

All defaults baked in. No env vars needed.

  python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
  torchrun --standalone --nproc_per_node=8 train_gpt.py

8xH100 SXM, 600s training + ~211s eval.


Key Metrics

Metric Value
val_bpb (phrase + n-gram) 0.2722
Sliding window val_bpb 1.1478
Post-quant val_bpb (standard) 1.1690
Pre-quant val_bpb 1.1646
Quant gap 0.004
Training time 600,031ms (9,268 steps at 64.7ms)
Eval time 211,362ms
Peak memory 13,058 MiB
Artifact size 14,942,971 bytes
Model parameters 25,254,992

Credits


Update Log

  • v1 (1.1855): int8+zlib, MLP 2x, seq 1024
  • v2 (1.1709): int6 QAT + lzma, MLP 3x, SWA, seq 2048
  • v3 (1.1668): + SmearGate + BigramHash + EMA + wider pre-enrichment
  • v4 (1.1629): + XSA on last 4 layers
  • v5 (1.0689): + EMA on GPU (64ms/step) + 5-gram eval cache
  • v6 (0.9784): + multi-order backoff 2-7 + entropy-adaptive alpha
  • v7 (0.9408): + extended to orders 2-11 + steeper alpha
  • v8 (0.9393): + pre-enrichment confidence modulation
  • v9 (0.2995): + two-phase shared cache + per-order adaptive alpha (3-seed: 0.2995)
  • v10 (0.2722): + long phrase cache (lengths 48, 36, 28, 20, 16)

@Idan3011 Idan3011 changed the title Record: EMA-GPU + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001) Mar 26, 2026
@Idan3011 Idan3011 changed the title Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.3001) Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) Mar 26, 2026
@Idan3011 Idan3011 changed the title Record: Two-Phase Shared N-gram Cache + EMA-GPU (val_bpb=0.2995, 3-seed) Record: Per-Order Adaptive Alpha + N-gram Backoff (val_bpb=0.2995, 3-seed) Mar 26, 2026
@Idan3011 Idan3011 changed the title Record: Per-Order Adaptive Alpha + N-gram Backoff (val_bpb=0.2995, 3-seed) Record: Phrase Cache + N-gram Backoff + EMA-GPU (val_bpb=0.2722) Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant