Skip to content

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895

Open
iverbovoy wants to merge 12 commits intoopenai:mainfrom
iverbovoy:4hour-run
Open

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889#895
iverbovoy wants to merge 12 commits intoopenai:mainfrom
iverbovoy:4hour-run

Conversation

@iverbovoy
Copy link
Copy Markdown

Summary

Depth recurrence scaling study — first data point on how shared-weight recurrence scales with extended compute.

Results

Eval val_bpb
Roundtrip 1.1613
Sliding window 1.1271
Sliding + Hedge Mixer 1.0889

vs existing unlimited compute entries:

  • Will DePue 4-hour flat baseline: 1.2074 → ours is -0.119 better
  • Ciprian-Florin Ifrim 2-hour 1-bit quant: 1.1239 → ours is -0.035 better

Key Finding

Shared-weight recurrence scales differently than flat architectures. At 132K steps with 5 repeats, each of the 3 blocks saw ~660K effective gradient passes. Progressive depth enables 5 repeats (15 effective layers) from 3 physical blocks — impossible to ramp dynamically with unique-layer architectures.

SWA at scale is massive: 38 checkpoints gave -0.060 bpb — larger than any single architectural change.

Scaling Curve

Steps Phase val_bpb
5K 2 rep 1.306
55K 2 rep 1.265
85K 3 rep 1.244
110K 4 rep 1.232
125K 5 rep 1.218
132K 5 rep + SWA 1.158

Test plan

  • Full 4-hour run on 8xH100 (14400s wallclock)
  • All 4 phase transitions completed without errors
  • Artifact 15.82MB < 16MB limit
  • Hedge Mixer eval 696s (score-first, no training data access)

- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers)
- Increase dim from 512 to 832, remove U-Net skips
- Add loop_embed for timestep encoding per effective layer
- Add cross-repeat skip: each block mixes in its output from previous repeat
  with per-repeat learned scales (stateful recurrence)
- Add 2 value embedding tables mixed into each layer with learned scales
- 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K
- Add eval_val_ttt: adapts model on each val batch before evaluating
- For each batch: save weights → K gradient steps → evaluate → restore
- Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4)
- Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027)
- TTT eval runs after normal roundtrip eval, reports both scores
- Sliding window eval: window=1024, stride=256, ~-0.034 bpb
- forward_logits() method for sliding window support
- LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner)
- GRAD_CLIP_NORM=0.3 for recurrence stability
- WARMDOWN_ITERS=3000
- train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step)
- Fix grad_accum for non-power-of-2 GPU counts
- Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)
- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing
  broken roundtrip on undertrained models
- Add Muon weight decay (0.04) for training stability
- Add SWA with float32 accumulation and final snapshot inclusion
- Remove sweep.sh
Improvements over previous submission (1.2196 → 1.2070, -0.014 bpb):
- XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb
- LeakyReLU(0.5)² instead of relu²: -0.004 bpb
- GPTQ-lite: per-row best-of-5 clip percentiles for quantization
- zstd-22 compression instead of zlib (saves ~1.85MB artifact)
- SWA tuned to frac=0.4, every=50

Tested on 8xH100, 80 train shards, PyTorch 2.5, 4290 steps.
Improvements over previous submission (1.2196 → 1.2065, -0.013 bpb):
- XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb
- LeakyReLU(0.5)² instead of relu²: -0.004 bpb
- GPTQ-lite: per-row best-of-5 clip percentiles
- zstd-22 compression instead of zlib
- SWA tuned to frac=0.4, every=50

8xH100, 80 train shards, 4300 steps, 140ms/step, 15.87MB artifact.
Dynamic depth scheduling unique to shared-weight recurrence:
- Phase 1 (0-40%): 2 repeats, ~75ms/step — fast base training
- Phase 2 (40-65%): 3 repeats, ~83ms/step — intermediate depth
- Phase 3 (65-100%): 4 repeats, ~100ms/step — full recurrence

5981 steps vs 4300 without progressive depth (+39%).
SWA collected only at full depth (last phase) to avoid mixing phases.
Removed unused TTT eval code.

8xH100, 80 train shards, sliding 1.1973 (-0.009 vs previous 1.2065).
Progressive depth scheduling (2→3→4 repeats) unique to shared-weight
recurrence. 5861 steps in 600s vs ~4300 at constant depth (+36%).
Fix DDP race condition in phase switching via all_reduce sync.
Systematic tuning on 8xH100 (6 runs):
- WARMDOWN_ITERS 3000→2000: full LR at phase 4 entry (-0.0009)
- MATRIX/SCALAR_LR 0.012→0.018: higher LR for progressive depth (-0.0011)
- Combined: val_bpb 1.1960 sliding (-0.0020 from 1.1980)

Tested and rejected: schedule changes (3-phase optimal), SWA_EVERY=25,
5 repeats, GRAD_CLIP=0.5, VRL, per-repeat LoRA (artifact >16MB).
5-expert online ensemble (neural + unigram + bigram + trigram + entropy)
via Hedge algorithm at eval time. -0.051 bpb over sliding window.
Tuned defaults: LR=0.018, WARMDOWN=2000 (-0.002 from previous).
Total improvement: 1.2244 → 1.1454 (-0.079 from baseline).
Depth recurrence scaling study: 132K steps, 5 repeats (15 effective layers),
38 SWA checkpoints, Hedge Mixer eval. First data point on how shared-weight
recurrence scales with compute. Beats 4-hour flat baseline (1.2074) by 0.119.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant