Skip to content

Record: 0.9076 BPB — 10L + N-gram Backoff + Matrix LR 0.03#828

Closed
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/10L-ngram-lr03-0.9076
Closed

Record: 0.9076 BPB — 10L + N-gram Backoff + Matrix LR 0.03#828
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/10L-ngram-lr03-0.9076

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Mar 26, 2026

Summary

val_bpb = 0.9074 (3-seed mean, std 0.0002) | 15.26-15.46 MB | 8xH100 SXM, 600s

Single change from PR #802: MATRIX_LR=0.03 (was 0.02). Discovered through systematic hyperparameter screening (74 experiments across steps 10-12).

Results

Seed Steps ms/step Pre-quant BPB N-gram BPB Artifact
42 6,693 89.6 1.1528 0.9076 15,320,749
1337 6,605 90.9 1.1521 0.9072 15,261,004
2024 6,607 90.8 1.1520 0.9074 15,457,538
Mean 0.9074 ± 0.0002

Key Change

MATRIX_LR=0.03 vs PR #802's default 0.02.

Architecture

  • 10L, 512d, GQA 8H/4KV, MLP 3x LeakyReLU(0.5)²
  • BigramHash(4096), SmearGate, Value Residual, Gated Attention
  • Mixed int5-MLP/int6-attn + zstd-22, EMA(0.997)

Eval: Multi-Order N-gram Backoff (from PR #802)

  • Score-first backward-looking n-gram cache (orders 2-7)
  • Entropy-adaptive alpha mixing
  • 133-156s eval time (well within 600s budget)

Reproduction

MATRIX_LR=0.03 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • 8xH100 SXM, seed 42: 0.9076 BPB
  • 8xH100 SXM, seed 1337: 0.9072 BPB
  • 8xH100 SXM, seed 2024: 0.9074 BPB
  • 3-seed mean: 0.9074 ± 0.0002
  • All artifacts ≤ 16MB (15.26-15.46 MB)
  • Training ≤ 600s
  • Eval ≤ 600s (133-156s)

Based On

🤖 Generated with Claude Code

Single change from PR openai#802: MATRIX_LR=0.03 (was 0.02).
Discovered through systematic screening (74 experiments, steps 10-12).

- 10L, 512d, GQA 8/4, LeakyReLU(0.5)², BigramHash 4096
- Multi-order n-gram backoff eval cache (orders 2-7)
- Entropy-adaptive alpha mixing (score-first, legal)
- 8xH100 SXM, 600s training, 138s eval
- Artifact: 15.32 MB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Nice result — the systematic hyperparameter screening (74 experiments) is a solid approach, and the MATRIX_LR finding is a clean single-variable improvement.

Heads up: the submission currently has 1 seed. The leaderboard requires 3-seed validation with statistical significance for record claims. Totally understand if you're waiting on compute before running the remaining seeds — just flagging so it doesn't get passed over during review.

greqone pushed a commit to greqone/parameter-golf that referenced this pull request Mar 26, 2026
… proxy)

10L + Multi-Order N-gram Backoff with entropy-adaptive alpha.
Validated on 1xH100 SXM (876 steps, 59% eval coverage).
Pending 8xH100 SXM verification for official record submission.

Based on PR openai#828 approach with MATRIX_LR=0.03.
Architecture: 10L, 512d, MLP 3x LeakyReLU(0.5)², XSA-4, VRL, BigramHash, SmearGate.
Artifact: 15.18 MB (under 16 MB limit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seeds 42 (0.9076), 1337 (0.9072), 2024 (0.9074).
All artifacts under 16MB (15.26-15.46 MB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants