Skip to content

Conversation

ansschh
Copy link

@ansschh ansschh commented Aug 15, 2025

  • torch/utils.py: Added error handling to init_distributed() function
    • CUDA availability check before initialization
    • Validate rank against available device count
    • Test device accessibility with clear error messages
    • Handle distributed communication setup failures
    • Clean up failed process groups

ansschh added 2 commits August 9, 2025 11:25
- Add return type hint to get_tokenizer()
- Add type hints and checkpoint validation to generate.py main()
- Add parameter type hints to suppress_output() in torch/utils.py

Improves IDE support and catches potential bugs early.
- Add CUDA availability check before device initialization
- Validate rank against available CUDA device count
- Add device accessibility testing with clear error messages
- Add error handling for distributed communication setup
- Add cleanup for failed distributed process group initialization
- Provide helpful error messages with troubleshooting guidance

This prevents cryptic CUDA errors and provides clear feedback when:
- CUDA is not available
- Invalid device rank is specified
- Device access fails
- Distributed communication fails
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants