feat: Added error handling for CUDA device initialization #135

ansschh · 2025-08-15T09:05:08Z

torch/utils.py: Added error handling to init_distributed() function
- CUDA availability check before initialization
- Validate rank against available device count
- Test device accessibility with clear error messages
- Handle distributed communication setup failures
- Clean up failed process groups

- Add return type hint to get_tokenizer() - Add type hints and checkpoint validation to generate.py main() - Add parameter type hints to suppress_output() in torch/utils.py Improves IDE support and catches potential bugs early.

- Add CUDA availability check before device initialization - Validate rank against available CUDA device count - Add device accessibility testing with clear error messages - Add error handling for distributed communication setup - Add cleanup for failed distributed process group initialization - Provide helpful error messages with troubleshooting guidance This prevents cryptic CUDA errors and provides clear feedback when: - CUDA is not available - Invalid device rank is specified - Device access fails - Distributed communication fails

ansschh added 2 commits August 9, 2025 11:25

pg56714 approved these changes Aug 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Added error handling for CUDA device initialization #135

feat: Added error handling for CUDA device initialization #135

ansschh commented Aug 15, 2025

Uh oh!

Uh oh!

feat: Added error handling for CUDA device initialization #135

Are you sure you want to change the base?

feat: Added error handling for CUDA device initialization #135

Conversation

ansschh commented Aug 15, 2025

Uh oh!

Uh oh!