Fix runtime issues for FSDP/DeepSpeed training by SajanGhimire1 · Pull Request #1017 · linkedin/Liger-Kernel

SajanGhimire1 · 2026-01-15T02:30:12Z

Removed unsafe usage of _MISSING_TYPE in parse_args.
Fixed KeyError in DataModule by correcting dataset field access.
Replaced set-based FSDP auto_wrap_policy with transformer_auto_wrap_policy.
Corrected Lightning precision strings to valid values (bf16-mixed).
Fixed devices argument to safely detect available GPUs.
Added safe get() for labels in training/validation steps to avoid KeyError.

Summary

Ensures stable and correct training across multi-GPU setups with FSDP/DeepSpeed by fixing dataset handling, auto-wrap policy, precision settings, and device detection.

Testing Done

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

- Removed unsafe usage of _MISSING_TYPE in parse_args. - Fixed KeyError in DataModule by correcting dataset field access. - Replaced set-based FSDP auto_wrap_policy with transformer_auto_wrap_policy. - Corrected Lightning precision strings to valid values (bf16-mixed). - Fixed devices argument to safely detect available GPUs. - Added safe get() for labels in training/validation steps to avoid KeyError.

Tcc0403 · 2026-01-15T07:10:00Z

examples/lightning/training.py

 from torch.utils.data import DataLoader
 from transformers.models.llama.modeling_llama import LlamaDecoderLayer
 from transformers.models.qwen2.modeling_qwen2 import Qwen2DecoderLayer
 from trl import DataCollatorForCompletionOnlyLM


What trl version should I use? I couldn't import DataCollatorForCompletionOnlyLM

DataCollatorForCompletionOnlyLM class requires trl >= 0.8.0. Please make sure that version (or higher) is installed to avoid import issues.

trl==0.26.2 doesn't work

trl==0.26.2 no longer exposes DataCollatorForCompletionOnlyLM in the same way. This example is intended to work with older TRL releases where the collator exists, specifically: trl>=0.8.0,<0.21.0 In newer TRL versions (including 0.26.x), the collator was refactored/removed, which causes the import error.

Tcc0403 · 2026-01-15T08:47:03Z

examples/lightning/training.py

    data: str = "cais/mmlu"
    output_dir: str = "mmlu_finetuning"
    max_length: int = 2048
-    # for llam3 8B model, deepspeed will OOM with 16 on 8XA100 80G and 8 will OOM with 8XA100 40G


why removing comments?

Restore removed comments without functional changes

Tcc0403 reviewed Jan 15, 2026

View reviewed changes

Update training.py

d785cb3

Restore removed comments without functional changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix runtime issues for FSDP/DeepSpeed training#1017

Fix runtime issues for FSDP/DeepSpeed training#1017
SajanGhimire1 wants to merge 2 commits intolinkedin:mainfrom
SajanGhimire1:patch-1

SajanGhimire1 commented Jan 15, 2026

Uh oh!

Tcc0403 Jan 15, 2026

Uh oh!

SajanGhimire1 Jan 15, 2026

Uh oh!

Tcc0403 Jan 15, 2026

Uh oh!

SajanGhimire1 Jan 15, 2026

Uh oh!

Tcc0403 Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

SajanGhimire1 commented Jan 15, 2026

Summary

Testing Done

Uh oh!

Tcc0403 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

SajanGhimire1 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

SajanGhimire1 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants