fix: compbined projection by ZhiyuLi-Nvidia · Pull Request #1324 · NVIDIA-NeMo/Automodel

ZhiyuLi-Nvidia · 2026-02-18T17:41:48Z

What does this PR do ?

Problem

The custom model's combined projections (qkv_proj, gate_up_proj) use ColwiseParallel for tensor parallelism, but the weight layout was a naive concatenation:

QKV: [all Q rows | all K rows | all V rows]
gate_up: [all gate rows | all up rows]
With ColwiseParallel (which shards dim 0 evenly), this gives each TP rank the wrong mix of Q/K/V heads — e.g., rank 0 gets all Q and some K, rank 1 gets remaining K and all V. This produces silently incorrect results, especially under GQA where Q and KV head counts differ.

Changelog

Interleaved weight layout so that ColwiseParallel sharding naturally gives each rank complete, matched groups:

QKV: KV-head-grouped layout [Q_group_0 | K_0 | V_0 | Q_group_1 | K_1 | V_1 | ...] — each TP rank gets whole KV-head groups with their corresponding Q heads.
gate_up: Row-interleaved layout [gate_0, up_0, gate_1, up_1, ...] — each TP rank gets matched gate/up pairs.

Also included: DCP-based base model loading option

Tests

Match hf and custom implementation loss curve

llama3-8b TP4FSDP2: https://wandb.ai/nvidia/automodel-dev-zhiyul/workspace?nw=62e6rt6s4kv
llama3-70b: https://wandb.ai/nvidia/automodel-dev-zhiyul?nw=43ugbs5opby

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

copy-pr-bot · 2026-02-18T17:41:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ZhiyuLi-Nvidia · 2026-02-18T19:36:45Z

/ok to test 4b3ac35

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-02-18T21:07:09Z

/ok to test 0687630

fix: compbined projection

4b3ac35

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia requested review from HuiyingLi, adil-a, akoumpa and hemildesai as code owners February 18, 2026 17:41

copy-pr-bot bot had a problem deploying to test February 18, 2026 19:37 Error

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 19:37 Inactive

lint

0687630

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 21:07 Inactive

copy-pr-bot bot temporarily deployed to test February 18, 2026 21:07 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 23:44 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 18, 2026 23:54 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 19, 2026 00:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci February 19, 2026 00:10 Failure

copy-pr-bot bot temporarily deployed to nemo-ci February 19, 2026 00:10 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci February 19, 2026 03:16 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: compbined projection#1324

fix: compbined projection#1324
ZhiyuLi-Nvidia wants to merge 2 commits intomainfrom
zhiyul/fix_combined_projection

ZhiyuLi-Nvidia commented Feb 18, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 18, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

ZhiyuLi-Nvidia commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Problem

Changelog

Tests

Additional Information

Uh oh!

copy-pr-bot bot commented Feb 18, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 18, 2026

Uh oh!

ZhiyuLi-Nvidia commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

ZhiyuLi-Nvidia commented Feb 18, 2026 •

edited

Loading