Non-record: 4-Hour Progressive Depth — val_bpb 1.0889 by iverbovoy · Pull Request #895 · openai/parameter-golf

iverbovoy · 2026-03-26T20:25:58Z

Summary

Depth recurrence scaling study — first data point on how shared-weight recurrence scales with extended compute.

4 hours on 8xH100, 132K steps, 3 shared blocks with progressive depth (2→3→4→5 repeats, 15 effective layers)
38 SWA checkpoints, Hedge Mixer eval (adapted from Record: 5-expert Hedge Mixer + TTT (3-seed mean val_bpb=1.0745) #688, Record: XSA-all + Depth Recurrence + Hedge Mixer TTT (val_bpb=1.0222, 3-seed mean) #745)
Builds on PRs Depth Recurrence + Cross-Repeat Skip + Sliding Window Eval #148, Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065) #784, Progressive Depth Training — val_bpb 1.1980 #835, Progressive Depth + Hedge Mixer — val_bpb 1.1454 #856

Results

Eval	val_bpb
Roundtrip	1.1613
Sliding window	1.1271
Sliding + Hedge Mixer	1.0889

vs existing unlimited compute entries:

Will DePue 4-hour flat baseline: 1.2074 → ours is -0.119 better
Ciprian-Florin Ifrim 2-hour 1-bit quant: 1.1239 → ours is -0.035 better

Key Finding

Shared-weight recurrence scales differently than flat architectures. At 132K steps with 5 repeats, each of the 3 blocks saw ~660K effective gradient passes. Progressive depth enables 5 repeats (15 effective layers) from 3 physical blocks — impossible to ramp dynamically with unique-layer architectures.

SWA at scale is massive: 38 checkpoints gave -0.060 bpb — larger than any single architectural change.

Scaling Curve

Steps	Phase	val_bpb
5K	2 rep	1.306
55K	2 rep	1.265
85K	3 rep	1.244
110K	4 rep	1.232
125K	5 rep	1.218
132K	5 rep + SWA	1.158

Test plan

Full 4-hour run on 8xH100 (14400s wallclock)
All 4 phase transitions completed without errors
Artifact 15.82MB < 16MB limit
Hedge Mixer eval 696s (score-first, no training data access)

- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers) - Increase dim from 512 to 832, remove U-Net skips - Add loop_embed for timestep encoding per effective layer - Add cross-repeat skip: each block mixes in its output from previous repeat with per-repeat learned scales (stateful recurrence) - Add 2 value embedding tables mixed into each layer with learned scales - 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K

- Add eval_val_ttt: adapts model on each val batch before evaluating - For each batch: save weights → K gradient steps → evaluate → restore - Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4) - Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027) - TTT eval runs after normal roundtrip eval, reports both scores

- Sliding window eval: window=1024, stride=256, ~-0.034 bpb - forward_logits() method for sliding window support - LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner) - GRAD_CLIP_NORM=0.3 for recurrence stability - WARMDOWN_ITERS=3000 - train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step) - Fix grad_accum for non-power-of-2 GPU counts - Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)

- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh

Improvements over previous submission (1.2196 → 1.2070, -0.014 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles for quantization - zstd-22 compression instead of zlib (saves ~1.85MB artifact) - SWA tuned to frac=0.4, every=50 Tested on 8xH100, 80 train shards, PyTorch 2.5, 4290 steps.

Improvements over previous submission (1.2196 → 1.2065, -0.013 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles - zstd-22 compression instead of zlib - SWA tuned to frac=0.4, every=50 8xH100, 80 train shards, 4300 steps, 140ms/step, 15.87MB artifact.

Dynamic depth scheduling unique to shared-weight recurrence: - Phase 1 (0-40%): 2 repeats, ~75ms/step — fast base training - Phase 2 (40-65%): 3 repeats, ~83ms/step — intermediate depth - Phase 3 (65-100%): 4 repeats, ~100ms/step — full recurrence 5981 steps vs 4300 without progressive depth (+39%). SWA collected only at full depth (last phase) to avoid mixing phases. Removed unused TTT eval code. 8xH100, 80 train shards, sliding 1.1973 (-0.009 vs previous 1.2065).

Progressive depth scheduling (2→3→4 repeats) unique to shared-weight recurrence. 5861 steps in 600s vs ~4300 at constant depth (+36%). Fix DDP race condition in phase switching via all_reduce sync.

Systematic tuning on 8xH100 (6 runs): - WARMDOWN_ITERS 3000→2000: full LR at phase 4 entry (-0.0009) - MATRIX/SCALAR_LR 0.012→0.018: higher LR for progressive depth (-0.0011) - Combined: val_bpb 1.1960 sliding (-0.0020 from 1.1980) Tested and rejected: schedule changes (3-phase optimal), SWA_EVERY=25, 5 repeats, GRAD_CLIP=0.5, VRL, per-repeat LoRA (artifact >16MB).

5-expert online ensemble (neural + unigram + bigram + trigram + entropy) via Hedge algorithm at eval time. -0.051 bpb over sliding window. Tuned defaults: LR=0.018, WARMDOWN=2000 (-0.002 from previous). Total improvement: 1.2244 → 1.1454 (-0.079 from baseline).

Depth recurrence scaling study: 132K steps, 5 repeats (15 effective layers), 38 SWA checkpoints, Hedge Mixer eval. First data point on how shared-weight recurrence scales with compute. Beats 4-hour flat baseline (1.2074) by 0.119.

iverbovoy added 12 commits March 20, 2026 03:37

Add submission: Depth Recurrence + Cross-Repeat Skip + Sliding Window

fa29306

Add SWA, Muon WD, fix quantization clamp

0f019a1

- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh

Add submission: Progressive Depth Training — val_bpb 1.1980

c0ce492

Progressive depth scheduling (2→3→4 repeats) unique to shared-weight recurrence. 5861 steps in 600s vs ~4300 at constant depth (+36%). Fix DDP race condition in phase switching via all_reduce sync.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895
iverbovoy wants to merge 12 commits intoopenai:mainfrom
iverbovoy:4hour-run

iverbovoy commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iverbovoy commented Mar 26, 2026

Summary

Results

Key Finding

Scaling Curve

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant