feat: add 1F1B schedule #96

JYMiracle305 · 2025-12-11T07:53:55Z

No description provided.

JYMiracle305 · 2025-12-22T09:13:21Z

新增超参数virtual_pipeline_parallel（vpp_size），表示PP场景对stage进行虚拟切分的块数，PP场景将模型切分成pp_size * vpp_size块，分配到对应的设备上；重构后统一不同调度策略对上层的接口，构造调度器PipelineParallelScheduler时根据不同策略填充任务Task表，任务表中保存子任务（关联chunk、microbatch和当前属于正/反向），训练时上层调用StepMicroBatches，StepMicroBatches内部遍历任务表。

virtual_pipeline_parallel为1时，调度表示如下：

virtual_pipeline_parallel大于1时，调度表示如下：

多机训练gpt2，配置 DDP=2，TP=2(SP=ON)，PP=2(VPP=2)

多机训练LLaMA3，配置 DDP=2，TP=2(SP=ON)，PP=2(VPP=2)

infini_train/src/nn/parallel/pp/pipeline_schedule.cc

infini_train/include/nn/parallel/pp/pipeline_parallel.h

example/llama3/net.cc

example/gpt2/net.h

infini_train/src/nn/parallel/pp/pipeline_parallel.cc

example/gpt2/net.h

infini_train/src/nn/parallel/pp/pipeline_stage.cc

example/gpt2/net.h

Chamberlain0w0 · 2025-12-24T03:13:06Z

example/gpt2/net.cc

-std::vector<std::shared_ptr<infini_train::Tensor>>
-GPT2::Forward(const std::vector<std::shared_ptr<infini_train::Tensor>> &x) {
-    int pp_rank = nn::parallel::pp_rank;
+void GPT2::BuildChunks() {


针对 Transformer 模型的话，BuildChunks 也可以合并，gpt2/llama 仅是一个 pos_emb 的区别，加个 if 判断就可以

example/gpt2/net.cc

feat: decouple DDP and PP for 1F1B pipeline in 3d parallelism

… base destructor

infini_train/include/nn/modules/container.h

infini_train/include/nn/parallel/pp/pipeline_stage.h

JYMiracle305 force-pushed the add_1F1B branch 3 times, most recently from 496bbfd to 7108a12 Compare December 16, 2025 14:54

JYMiracle305 force-pushed the add_1F1B branch 2 times, most recently from 3726518 to 9af4751 Compare December 22, 2025 09:04

JYMiracle305 force-pushed the add_1F1B branch from 9af4751 to a413a6e Compare December 22, 2025 09:24

JYMiracle305 requested review from Chamberlain0w0 and kilinchange and removed request for kilinchange December 22, 2025 09:35

kilinchange self-requested a review December 22, 2025 14:31

kilinchange requested changes Dec 24, 2025

View reviewed changes

Chamberlain0w0 reviewed Dec 24, 2025

View reviewed changes

JYMiracle305 added 2 commits December 24, 2025 15:46

feat: add 1F1B schedule

f92b7db

feat: implement the task architecture of the PP scheduler

6c54f68

JYMiracle305 force-pushed the add_1F1B branch 3 times, most recently from 0f5628b to aeb8ee0 Compare December 25, 2025 04:59

kilinchange requested changes Dec 25, 2025

View reviewed changes

example/gpt2/net.cc Show resolved Hide resolved

example/gpt2/net.cc Outdated Show resolved Hide resolved

example/gpt2/net.cc Outdated Show resolved Hide resolved

JYMiracle305 force-pushed the add_1F1B branch 2 times, most recently from f8b086c to c22da40 Compare December 26, 2025 03:21

feat: chunk overrides forward function

213a164

JYMiracle305 force-pushed the add_1F1B branch from c22da40 to 213a164 Compare December 26, 2025 03:33

JYMiracle305 and others added 7 commits December 30, 2025 02:28

fix: accurately skip padding tokens with PP>2, TP>1

430ce29

feat: decouple DDP and PP for 1F1B pipeline in 3d parallelism

b78ee4e

fix: filter out duplicated pp params

d25d747

feat: pass pp_rank as a parameter to getStageInfo

2200305

fix: use constexpr names for pp layer indexing

224dc48

Merge pull request #100 from InfiniTensor/dcj_dev

9f15c49

feat: decouple DDP and PP for 1F1B pipeline in 3d parallelism

fix: resolve the Progress_group and Device subclasses cannot call the…

153826c

… base destructor

kilinchange requested changes Jan 4, 2026

View reviewed changes

infini_train/include/nn/modules/container.h Show resolved Hide resolved

infini_train/include/nn/parallel/pp/pipeline_stage.h Outdated Show resolved Hide resolved

feat: add virtual_pipeline_parallel test scripts

f2a383a

JYMiracle305 force-pushed the add_1F1B branch from e866fcd to f2a383a Compare January 4, 2026 09:55

fix: resolve review comments

e3d4030

kilinchange approved these changes Jan 5, 2026

View reviewed changes

kilinchange merged commit 83d11cc into master Jan 5, 2026
2 checks passed

kilinchange deleted the add_1F1B branch January 5, 2026 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add 1F1B schedule #96

feat: add 1F1B schedule #96

Uh oh!

JYMiracle305 commented Dec 11, 2025

Uh oh!

JYMiracle305 commented Dec 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chamberlain0w0 Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add 1F1B schedule #96

feat: add 1F1B schedule #96

Uh oh!

Conversation

JYMiracle305 commented Dec 11, 2025

Uh oh!

JYMiracle305 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Chamberlain0w0 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JYMiracle305 commented Dec 22, 2025 •

edited

Loading