Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] torchrun subprocess received Signal 8 (SIGFPE) #547

Open
ZeXin-Wang opened this issue Feb 3, 2025 · 1 comment
Open

[BUG] torchrun subprocess received Signal 8 (SIGFPE) #547

ZeXin-Wang opened this issue Feb 3, 2025 · 1 comment

Comments

@ZeXin-Wang
Copy link

ZeXin-Wang commented Feb 3, 2025

Describe the bug

Image

Image

To Reproduce

conda env remove -n deepseek
conda create --name deepseek python=3.10
conda activate deepseek
pip install -r requirements.txt
python3 convert.py --hf-ckpt-path DeepSeek-V3 --save-path  DeepSeek-V3-Demo --n-experts 256 --model-parallel 8
# run on single machine with H20*8
torchrun --nnodes 1 --nproc-per-node 8 --node-rank 0 --master-addr 127.0.0.1 generate.py --ckpt-path DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200
@GeeeekExplorer
Copy link
Contributor

There should be more detailed Python error messages before the torchrun error occurs. Also, which version of PyTorch and Triton do you use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants