Skip to content

A unified suite for generating elite reasoning problems and training high-performance LLMs, including pioneering attention-free architectures

License

Notifications You must be signed in to change notification settings

inclusionAI/PromptCoT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PromptCoT 2.0

Scaling Prompt Synthesis for LLM Reasoning

📄 Paper • 🤗 Hugging Face

PromptCoT 2.0 Logo


✨ Overview

PromptCoT 2.0 is a principled and scalable framework for prompt synthesis that substantially advances LLM reasoning in both mathematics and programming.

It introduces an EM-style rationale-driven synthesis loop (concept → rationale → problem), enabling the automatic generation of diverse and challenging problems at scale. These synthetic prompts support two complementary training regimes:

Self-Play: the model improves autonomously by learning from verifiable signals (e.g., unit tests for code, boxed answers for math). With this approach, a 30B-A3B self-play model achieves 92.1 on AIME24, 89.8 on AIME25, and 76.7 on HMMT Feb25, as well as 74.2 on LiveCodeBench v5, 71.0 on v6, and 2079 Elo on Codeforces. These results surpass strong open-source baselines (Qwen3-30B-A3B-Thinking) and achieve competitive performance with closed-source leaders such as Gemini 2.5 Pro and OpenAI o3 across math and code.

SFT: a 7B model trained 100% on synthetic data—using prompts synthesized by PromptCoT 2.0 and complete reasoning trajectories distilled from GPT-OSS-120B (medium)—reaches 73.1 on AIME24, 65.6 on AIME25, and 1815 Elo on Codeforces, outperforming counterparts trained on human-written prompts.

Unleash the PromptCoT tide of reasoning!


âš¡ Main Results

Self-Play @ Qwen3-30B-A3B-2507-Thinking:

PromptCoT 2.0 Self-Play (30B-A3B) vs baselines

PromptCoT 2.0 demonstrates that large-scale self-play with verifiable signals is effective for advancing LLM reasoning. At 30B scale, self-play achieves performance competitive with closed-source leaders (Gemini 2.5 Pro, OpenAI o3) and surpasses strong open-source baselines.

SFT @ Qwen2.5-7B-Instruct:

PromptCoT 2.0 SFT (7B) vs OpenCodeReasoning vs OpenMathReasoning

PromptCoT 2.0 (7B, SFT) is the first model trained entirely on synthetic prompts with trajectories distilled from GPT-OSS-120B. Unlike OpenCodeReasoning and OpenMathReasoning — both built on human-written prompts — PromptCoT 2.0 achieves stronger performance, highlighting the potential of fully synthetic prompt synthesis as a foundation for reasoning models.


🔮 Releases

[2025/10/26] We release the problem generation recipe (problem_generation.sh), enabling full reproduction of PromptCoT 2.0's scalable synthesis pipeline from concept files.

[2025/09/24] We release PromptCoT 2.0:
the first framework to scale prompt synthesis across both math and programming, enabling 30B self-play competitive with Gemini 2.5 Pro / OpenAI o3, and 7B SFT (100% synthetic prompts) surpassing human-written baselines.

📂 Resources

[2025/05/30] We release PromptCoT-Mamba (🤗 PromptCoT-Mamba-7B):
the first attention-free reasoning model, combining PromptCoT with Mamba-2 to achieve strong math & code performance with constant-memory inference.

[2025/04/11] We release PromptCoT-QwQ-32B and PromptCoT-QwQ-Dataset:
self-play of QwQ-32B using PromptCoT synthetic problems, with dedicated datasets for reproducible training.

[2025/03/07] We release PromptCoT 1.0 (🤗 HF Collection):
the first rationale-driven synthesis pipeline for Olympiad-level math problems, releasing problem generation models, distilled models, and datasets.


Quick Start

git clone https://github.com/inclusionAI/PromptCoT
cd PromptCoT
pip install -r requirements.txt

Configuration

Top-level scripts support loading default configuration values from a local .env file.

  1. Copy .env.example to .env
  2. Edit values (for example MODEL_PATH, N_GPUS, DATA_PATH, OUTPUT_PATH)
  3. Validate your setup:
python validate_config.py

Notes:

  • Precedence is CLI args > .env > code defaults.
  • MODEL_PATH / TOKENIZER_PATH can be a local path or a Hugging Face model id; the validator only checks filesystem paths.
  • Empty strings in .env are treated as "unset" (e.g. DATA_PATH= behaves like not set).
  • Prefer namespaced environment variables (e.g. SPLIT_MERGE_OUTPUT_PATH, SELF_PLAY_OUTPUT_PATH) to avoid collisions when you run multiple scripts from the same .env.
  • Some scripts historically used different env var names (e.g. infer_split_merge.py uses N_SPLITS, while infer_self_play.py uses NUM_SPLITS); .env.example documents the mapping and the code includes small fallbacks for these.

To run the lightweight unit tests in this repo:

python -m unittest discover -s tests -v

🧩 Problem Generation (Concept → Rationale → Problem)

We provide a script to synthesize problems from concept files using the PromptCoT 2.0 pipeline.

  • Concept files: available at xl-zhao/PromptCoT-2.0-Concepts (e.g., PromptCoT-2.0-Concepts/code.jsonl).
  • Model: set --model_path in the script to your PromptCoT-2.0-Prompt-Generation-Model (see Releases for links).

Make the script executable and run:

chmod +x problem_generation.sh
./problem_generation.sh

Self-Play Pipeline (Code Example)

We illustrate the self-play workflow in the code domain, where unit tests provide verifiable reward signals.


Step 1 — Verifiable Reward Generation (test case construction)
The input .jsonl file must include a "problem" field for each instance, specifying the coding task to be solved.
In each run, a new test case is generated and appended to the "completions" field, progressively enriching the specification.

# Generate 4 rounds of test cases with different seeds
for seed in {0..3}; do
  python test_cases_generation.py \
    --seed $seed \
    --data_path code/prompts_test_cases_${seed}.jsonl \
    --output_path code/prompts_test_cases_$((seed+1)).jsonl \
    --model_path Qwen/Qwen3-32B \
    --n_gpus 4 \
    --temperature 0.6 \
    --max_len 16384 \
    --use_chat_template True
done

Post-process the generated test cases into a structured format:

python test_cases_postprocess.py \
  --input_file code/prompts_test_cases_4.jsonl \
  --output_path code/prompts_test_cases_processed.jsonl

Step 2 — Self-Play Trajectory Collection Using the processed test cases, generate diverse trajectories by sampling across multiple seeds:

for seed in {0..7}; do
  python infer_self_play.py \
    --data_path code/selfplay_${seed}.jsonl \
    --output_path code/selfplay_$((seed+1)).jsonl \
    --model_path Qwen/Qwen3-30B-A3B-Thinking-2507 \
    --trust_remote_code True \
    --n_gpus 8 \
    --num_splits 4 \
    --num_completions 8 \
    --seed $seed \
    --temperature 1.2 \
    --max_len 81920 \
    --use_chat_template True
done

Step 3 — Reward Assignment Evaluate each trajectory against the constructed test cases and assign reward signals automatically:

python self_play_eval.py \
  --data_path code/selfplay_8.jsonl \
  --output_path code/selfplay_verified.jsonl \
  --eval_type code \
  --num_workers 16

Step 4 — Pair Construction Aggregate verified trajectories into chosen vs. rejected pairs for offline self-play training:

python prepare_self_play_data.py \
  --data_path code/selfplay_verified.jsonl \
  --output_path code/selfplay_training.jsonl

SFT Pipeline (Code Example)

We illustrate the SFT workflow in the code domain, using teacher trajectories from GPT-OSS-120B.


Step 1 — Teacher Trajectory Collection Sample teacher responses for each prompt, with one trajectory per problem:

python infer_self_play.py \
  --data_path code/prompts_test_cases_processed.jsonl \
  --output_path code/prompts_trajectories.jsonl \
  --model_path openai/gpt-oss-120b \
  --trust_remote_code True \
  --n_gpus 8 \
  --num_splits 4 \
  --num_completions 1 \
  --seed 0 \
  --temperature 1.0 \
  --max_len 16384 \
  --use_chat_template True

Step 2 — Data Post-Processing Filter incomplete or invalid trajectories, and format them into clean prompt–completion pairs for supervised fine-tuning:

python prepare_sft_data_code.py \
  --data_path code/prompts_trajectories.jsonl \
  --output_path code/sft_training.jsonl \
  --tokenizer_path Qwen/Qwen2.5-7B-Instruct

Benchmark Results Reproduction

We provide scripts to reproduce results for both self-play and SFT models.
For math evaluations, we recommend setting VLLM_USE_V1=0 to ensure reproducibility.


Self-Play Models

30B-A3B (Math)

for dataset in aime24 aime25 hmmt25; do
  python infer_split_merge.py \
    --data_path data/promptcot2_${dataset}_test.jsonl \
    --output_path qwen_evals/30b_a3b/${dataset}.jsonl \
    --model_path /path/to/PromptCoT-2.0-SelfPlay-30B-A3B \
    --n_splits 4 \
    --expected_runs 16 \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_len 81920 \
    --factor 1.75 \
    --original_max_position_embeddings 262144
done

30B-A3B (Code)

# Codeforces
python infer_split_merge.py \
  --data_path data/promptcot2_codeforces_test.jsonl \
  --output_path qwen_evals/30b_a3b/codeforces.jsonl \
  --model_path /path/to/PromptCoT-2.0-SelfPlay-30B-A3B \
  --n_splits 1 \
  --expected_runs 8 \
  --temperature 0.6 \
  --top_p 0.95 \
  --max_len 81920 \
  --factor 1.75 \
  --original_max_position_embeddings 262144

# LiveCodeBench v5 / v6
for dataset in lcb_v5 lcb_v6; do
  python infer_split_merge.py \
    --data_path data/promptcot2_${dataset}_test.jsonl \
    --output_path qwen_evals/30b_a3b/${dataset}.jsonl \
    --model_path /path/to/PromptCoT-2.0-SelfPlay-30B-A3B \
    --n_splits 1 \
    --expected_runs 1 \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_len 81920 \
    --factor 1.75 \
    --original_max_position_embeddings 262144
done

4B (Math)

for dataset in aime24 aime25 hmmt25; do
  python infer_split_merge.py \
    --data_path data/promptcot2_${dataset}_test.jsonl \
    --output_path qwen_evals/4b/${dataset}.jsonl \
    --model_path /path/to/PromptCoT-2.0-SelfPlay-4B \
    --n_splits 8 \
    --expected_runs 16 \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_len 81920 \
    --factor 1.75 \
    --original_max_position_embeddings 262144
done

4B (Code)

# Codeforces
python infer_split_merge.py \
  --data_path data/promptcot2_codeforces_test.jsonl \
  --output_path qwen_evals/4b/codeforces.jsonl \
  --model_path /path/to/PromptCoT-2.0-SelfPlay-4B \
  --n_splits 4 \
  --expected_runs 8 \
  --temperature 0.6 \
  --top_p 0.95 \
  --max_len 81920 \
  --factor 1.75 \
  --original_max_position_embeddings 262144

# LiveCodeBench v5 / v6
for dataset in lcb_v5 lcb_v6; do
  python infer_split_merge.py \
    --data_path data/promptcot2_${dataset}_test.jsonl \
    --output_path qwen_evals/4b/${dataset}.jsonl \
    --model_path /path/to/PromptCoT-2.0-SelfPlay-4B \
    --n_splits 8 \
    --expected_runs 1 \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_len 81920 \
    --factor 1.75 \
    --original_max_position_embeddings 262144
done

SFT Models (7B)

Math

for dataset in aime24 aime25 hmmt25; do
  python infer_split_merge.py \
    --data_path data/promptcot2_${dataset}_test.jsonl \
    --output_path qwen_evals/sft/${dataset}.jsonl \
    --model_path /path/to/PromptCoT-2.0-SFT-7B \
    --n_splits 8 \
    --expected_runs 16 \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_len 81920
done

Code

# Codeforces
python infer_split_merge.py \
  --data_path data/promptcot2_codeforces_test.jsonl \
  --output_path qwen_evals/sft/codeforces.jsonl \
  --model_path /path/to/PromptCoT-2.0-SFT-7B \
  --n_splits 8 \
  --expected_runs 8 \
  --temperature 0.6 \
  --top_p 0.95 \
  --max_len 81920

# LiveCodeBench v5 / v6
for dataset in lcb_v5 lcb_v6; do
  python infer_split_merge.py \
    --data_path data/promptcot2_${dataset}_test.jsonl \
    --output_path qwen_evals/sft/${dataset}.jsonl \
    --model_path /path/to/PromptCoT-2.0-SFT-7B \
    --n_splits 8 \
    --expected_runs 1 \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_len 81920
done

📜 Citation

If you find the PromptCoT series useful, please consider citing our work:

@article{zhao2025promptcot2,
  title     = {PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Gong, Zhuocheng and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2509.19894},
  year      = {2025},
  url       = {https://arxiv.org/abs/2509.19894}
}

@article{zhao2025scaling,
  title     = {Scaling Reasoning without Attention},
  author    = {Zhao, Xueliang and Wu, Wei and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2505.22425},
  year      = {2025},
  url       = {https://arxiv.org/abs/2505.22425}
}

@article{zhao2025promptcot,
  title     = {PromptCoT: Synthesizing Olympiad-Level Problems for Mathematical Reasoning in Large Language Models},
  author    = {Zhao, Xueliang and Wu, Wei and Guan, Jian and Kong, Lingpeng},
  journal   = {arXiv preprint arXiv:2503.02324},
  year      = {2025},
  url       = {https://arxiv.org/abs/2503.02324}
}

About

A unified suite for generating elite reasoning problems and training high-performance LLMs, including pioneering attention-free architectures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •