-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix param input order for cudagraph #1138
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix seems plausible. It seems that make_graphed_callables
expects sample_args
to be ordered first by layer number, then by microbatch, then by model chunk:
TransformerEngine/transformer_engine/pytorch/graph.py
Lines 236 to 238 in 0ece13e
per_callable_fwd_idx = (m_chunk * num_microbatches * num_layers) + ( | |
fwd_idx[m_chunk] * num_layers + l_no | |
) |
However, I see some of our MLPerf wrappers order by microbatch, then layer number, then model chunk: https://gitlab-master.nvidia.com/dl/mlperf/optimized/-/blob/main/large_language_model/pytorch/custom_callbacks.py#L249-L254
Pinging @ksivaman.
Also, can you sign your commit to pass the DCO check?
Signed-off-by: yifeis-nv <[email protected]>
0ece13e
to
4235bfe
Compare
THX for your reminder! I have signed my commit. |
transformer_engine/pytorch/graph.py
Outdated
@@ -171,8 +171,8 @@ def _make_graphed_callables( | |||
] | |||
else: | |||
per_callable_module_params = [] | |||
for c in callables: | |||
for i in range(num_microbatches): | |||
for i in range(num_microbatches): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change doesn't appear to fully solve the bug. For example this fix will work only when the number of model chunks (num_model_chunks) is 1. The correct solution will be
for m_chunk in range(num_model_chunks):
for idx in range(num_microbatches):
for l_no in range(num_layers):
per_callable_module_params.append(
tuple(callables[m_chunk*num_layers + l_no].parameters()) if isinstance(c, torch.nn.Module) else ()
)
Can you test if this fix works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your input! This works for my situation.
Signed-off-by: yifeis-nv <[email protected]>
for more information, see https://pre-commit.ci
/te-ci pytorch |
Description
I discovered that when I attempt to use cudagraph during Pipeline Parallelism, the gradient becomes incorrect, ultimately leading to a NaN issue. After debugging, I identified a small bug in TE's graph.py.
Fixes # (issue)
Here is the translation of your text into English for your GitHub issue description: Since the
make_graphed_callables
function in TE implements the backward graph through thetorch.autograd.grad
function, the weights are also passed into thetorch.autograd.grad
function through the inputs. This requires that the order of inputs intorch.autograd.grad
matches the order in the forward graph; otherwise, it will lead to backward errors.Type of change
Changes
Modify the input order of weights inside of cudagraph related module
Checklist: