Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support to sync params when train_tp divides inference_tp. #92

Merged
merged 13 commits into from
Sep 26, 2024

Conversation

charles9304
Copy link
Collaborator

No description provided.

chatlearn/models/base_module.py Outdated Show resolved Hide resolved
chatlearn/runtime/parameter_sync.py Outdated Show resolved Hide resolved
chatlearn/models/base_module.py Show resolved Hide resolved
chatlearn/runtime/parameter_sync.py Show resolved Hide resolved
chatlearn/runtime/parameter_sync.py Show resolved Hide resolved
Copy link
Collaborator

@haolin-nju haolin-nju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway,我觉得我这边review的意见可以分成如下几点:

  1. TP不对等情况下broadcast是不是在做两阶段的通信?如果追求性能,是不是也需要像TP对等一样实现P2P的通信,再对比P2P和两阶段broadcast的最终效果呢?
  2. 现在直接上两阶段broadcast的话整体看下来代码比较复杂,如果业务方需求dst_tp < src_tp,或者有一个新的模型定义,可能要对代码进行二次改动,维护成本比较高
  3. 代码的封装和解耦可能需要再细化一下
  4. 2 stage的param sync缺少UT,Regression Test成本较高且很难暴露corner case

chatlearn/models/base_module.py Show resolved Hide resolved
chatlearn/models/base_module.py Show resolved Hide resolved
chatlearn/runtime/parameter_sync.py Show resolved Hide resolved
@alibaba alibaba deleted a comment from haolin-nju Sep 25, 2024
@haolin-nju
Copy link
Collaborator

Anyway,我觉得我这边review的意见可以分成如下几点:

  1. TP不对等情况下broadcast是不是在做两阶段的通信?如果追求性能,是不是也需要像TP对等一样实现P2P的通信,再对比P2P和两阶段broadcast的最终效果呢?
  2. 现在直接上两阶段broadcast的话整体看下来代码比较复杂,如果业务方需求dst_tp < src_tp,或者有一个新的模型定义,可能要对代码进行二次改动,维护成本比较高
  3. 代码的封装和解耦可能需要再细化一下
  4. 2 stage的param sync缺少UT,Regression Test成本较高且很难暴露corner case

Resolved:
问题1/2: TP不对等的场景下,p2p不可行,会出现同个GPU send/recv的操作,这个在nccl通信是不支持的。

问题3:之前沟通的时候,建议将tp对等和非对等场景代码分开。也是担心改动影响tp对等场景。所以这版一些能共用的代码我也没有分开,暂时先不做改动,后面再慢慢合并两个场景的支持吧

问题4 ut/ft正准备专门提一个CI

@charles9304
Copy link
Collaborator Author

support for mcore model: #105

@charles9304 charles9304 merged commit 87961e2 into main Sep 26, 2024
3 checks passed
@charles9304 charles9304 deleted the feat_ps_new branch September 26, 2024 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants