Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DeepSeek-V2 expert-parallelism failure due to indexing error #1765

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

skavulya
Copy link
Contributor

What does this PR do?

Fixes indexing error during multi-card inference with expert parallelism on DeepSeek-V2. The indexing issue causes the following failure:

python ../gaudi_spawn.py--world_size=2 run_generation.py --model_name_or_path deepseek-ai/DeepSeek-V2-Lite --use_kv_cache --max_new_tokens 100 --batch_size 1 --bf16 --use_hpu_graphs --parallel_strategy "ep" --prompt "DeepSpeed is a machine learning framework"

Stack trace:

Warming up iteration 1/3 /usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:601: UserWarning: do_sample is set to False. However, temperature is set to 0.3 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:606: UserWarning: do_sample is set to False. However, top_p is set to 0.95 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset top_p. warnings.warn( Setting pad_token_id to eos_token_id:None for open-end generation. [rank0]: Traceback (most recent call last): [rank0]: File "/root/optimum-habana/examples/text-generation/run_generation.py", line 801, in [rank0]: main() [rank0]: File "/root/optimum-habana/examples/text-generation/run_generation.py", line 563, in main [rank0]: generate(None, args.reduce_recompile) [rank0]: File "/root/optimum-habana/examples/text-generation/run_generation.py", line 534, in generate [rank0]: outputs = model.generate( [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank0]: return func(*args, **kwargs) [rank0]: File "/root/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1477, in generate [rank0]: result = self._sample( [rank0]: File "/root/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2458, in _sample [rank0]: outputs = self( [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl [rank0]: return self._call_impl(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl [rank0]: return forward_call(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 745, in forward [rank0]: return wrapped_hpugraph_forward( [rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 610, in wrapped_hpugraph_forward [rank0]: outputs = orig_fwd(*args, **kwargs)

[rank0]: File "/root/optimum-habana/optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py", line 1918, in forward
[rank0]: outputs = self.model(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank0]: return inner()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/root/optimum-habana/optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py", line 1714, in forward
[rank0]: layer_outputs = decoder_layer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank0]: return inner()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/root/optimum-habana/optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py", line 1411, in forward
[rank0]: hidden_states = self.mlp(hidden_states)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
[rank0]: return inner()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/root/optimum-habana/optimum/habana/transformers/models/deepseek_v2/modeling_deepseek_v2.py", line 700, in forward
[rank0]: htcore.mark_step()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/utils/internal.py", line 36, in lazy_wrapper
[rank0]: func(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/step_closure.py", line 71, in mark_step
[rank0]: htcore._mark_step(device_str, sync)
[rank0]: RuntimeError: synNodeCreateWithId failed for node: moe_bf16 with synStatus 26 [Generic failure]. .

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@skavulya skavulya requested a review from regisss as a code owner February 11, 2025 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant