Skip to content

Conversation

@SrijanUpadhyay
Copy link
Contributor

Issue #41720: CUDA asserts during multi-GPU generation with Qwen3 models due to NaN/Inf in hidden states.

Changes:

  • Enhanced InfNanRemoveLogitsProcessor to handle hidden state stabilization
  • Added automatic remove_invalid_values=True for sharded models
  • Removed direct nan handling from Qwen3 model for cleaner architecture

Fixes #41720

Issue huggingface#41720: CUDA asserts during multi-GPU generation with Qwen3 models due to NaN/Inf in hidden states.

Changes:
- Enhanced InfNanRemoveLogitsProcessor to handle hidden state stabilization
- Added automatic remove_invalid_values=True for sharded models
- Removed direct nan handling from Qwen3 model for cleaner architecture

Fixes huggingface#41720
@SrijanUpadhyay
Copy link
Contributor Author

Hey! @vasqu, i have made these changes, please look into it and provide me feedback on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3 with auto device mapping fails due to cudaErrorAssert on A800

1 participant