Skip to content

Comments

Fix DeepSeek-V3 H100 large scale timeout issue#2401

Open
scsudhakaran wants to merge 1 commit intomainfrom
scsudhakaran/dsv3
Open

Fix DeepSeek-V3 H100 large scale timeout issue#2401
scsudhakaran wants to merge 1 commit intomainfrom
scsudhakaran/dsv3

Conversation

@scsudhakaran
Copy link
Contributor

@scsudhakaran scsudhakaran commented Feb 17, 2026

Summary by CodeRabbit

  • Chores
    • Updated backend configuration parameters to optimize performance handling for large-scale model operations.

Signed-off-by: Sanju C Sudhakaran <scsudhakaran@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@scsudhakaran
Copy link
Contributor Author

/ok to test 1a8e1d0

@scsudhakaran scsudhakaran added this to the 26.02 milestone Feb 17, 2026
@scsudhakaran scsudhakaran added the r0.3.0 Cherry-pick label for r0.3.0 release branch label Feb 17, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 17, 2026

No actionable comments were generated in the recent review. 🎉


📝 Walkthrough

Walkthrough

This PR modifies the DeepSeek V3 H100 FP8 SC Large Scale pretrain configuration by adding two parameters: virtual_pipeline_model_parallel_size set to 2 and pp_layout set to None. These parameters configure the virtual pipeline model parallel degree and pipeline layout strategy for the large-scale H100 FP8 SC variant.

Changes

Cohort / File(s) Summary
DeepSeek Performance Configuration
scripts/performance/configs/deepseek/deepseek_workload_base_configs.py
Added virtual_pipeline_model_parallel_size=2 and pp_layout=None parameters to the DeepSeek V3 H100 FP8 SC Large Scale pretrain configuration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • DeepSeek-V3 recipes for H100 #2312 — Modifies the same DeepSeek H100 pretrain base configs file by adding/setting identical virtual pipeline parallelism and pipeline layout parameters.

Suggested labels

performance

Suggested reviewers

  • ko3n1g
  • erhoo82
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title claims to 'fix a timeout issue' but the changes only modify parallelism parameters (virtual_pipeline_model_parallel_size and pp_layout) without evidence of addressing timeouts. Revise the title to accurately reflect the actual changes, such as 'Add virtual pipeline and pp layout parameters to DeepSeek-V3 H100 large scale config' or provide context explaining how these parameter changes resolve the timeout issue.
Test Results For Major Changes ⚠️ Warning PR claims to fix DeepSeek-V3 H100 timeout issue, but known-issues.md created in same commit still lists it as active problem with no evidence provided that fix resolves the timeout. Add test results proving H100 training succeeds without timeout, update known-issues.md to reflect fix status, document why parameter changes resolve timeout, and provide convergence/performance validation.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch scsudhakaran/dsv3

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.3.0 Cherry-pick label for r0.3.0 release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant