Phase 9: RL-Guided Exploration
Tentative / long-term. DeepSeek-Prover v2 demonstrated that an RL-trained AI system can exploit subtle Lean bugs. This validates our LLM-guided approach and suggests RL-based exploration as a high-leverage future direction.
Status: NOT STARTED (speculative/long-term)
Ref: docs/design-plan.md Phase 9
Reward signal design
Use the tiered corpus classification (Phase 4) as the reward function:
| Outcome |
Reward |
| Prefix doesn't compile |
0.0 |
| Prefix compiles, golden suffix has type error |
0.3 |
| Prefix compiles, golden suffix fails with "proof not found" |
0.7 (Tier 0) |
| Prefix compiles, golden suffix succeeds (soundness bug!) |
1.0 |
Policy architecture
Integration with UCB1 bandits
Prerequisites
Earliest feasible start: after Phases 1-5 are mature and producing corpus data.
Phase 9: RL-Guided Exploration
Status: NOT STARTED (speculative/long-term)
Ref:
docs/design-plan.mdPhase 9Reward signal design
Use the tiered corpus classification (Phase 4) as the reward function:
Policy architecture
Integration with UCB1 bandits
Prerequisites
Earliest feasible start: after Phases 1-5 are mature and producing corpus data.