v0.13.0 Launcher update (multinode and GPU selection) and mutliple bug fixes
Better multinode support in the launcher
The accelerate command
launch did not work well for distributed training using several machines. This is fixed in this version.
- Use torchrun for multinode by @muellerzr in #631
- Fix multi-node issues from launch by @muellerzr in #672
Launch training on specific GPUs only
Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx
you can now specify the GPUs you want to use in your Accelerate config.
- Allow for GPU-ID specification on CLI by @muellerzr in #732
Better tracebacks and rich support
The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.
- Integrate Rich into Accelerate by @muellerzr in #613
- Make rich an optional dep by @muellerzr in #673
What's new?
- Fix typo in docs/index.mdx by @mishig25 in #610
- Fix DeepSpeed CI by @muellerzr in #612
- Added GANs example to examples by @EyalMichaeli in #619
- Fix example by @muellerzr in #620
- Update README.md by @ezhang7423 in #622
- Fully remove
subprocess
from the multi-gpu launcher by @muellerzr in #623 - M1 mps fixes by @pacman100 in #625
- Fix multi-node issues and simplify param logic by @muellerzr in #627
- update MPS support docs by @pacman100 in #629
- minor tracker fixes for complete* examples by @pacman100 in #630
- Put back in place the guard by @muellerzr in #634
- make init_trackers to launch on main process by @Gladiator07 in #642
- remove check for main process for trackers initialization by @Gladiator07 in #643
- fix link by @philschmid in #645
- Add static_graph arg to DistributedDataParallelKwargs. by @rom1504 in #637
- Small nits to grad accum docs by @muellerzr in #656
- Saving hyperparams in yaml file for Tensorboard for #521 by @Shreyz-max in #657
- Use debug for loggers by @muellerzr in #655
- Improve docstrings more by @muellerzr in #666
- accelerate bibtex by @pacman100 in #660
- Cache torch_tpu check by @muellerzr in #670
- Manim animation of big model inference by @muellerzr in #671
- Add aim tracker for accelerate by @muellerzr in #649
- Specify local network on multinode by @muellerzr in #674
- Test for min torch version + fix all issues by @muellerzr in #638
- deepspeed enhancements and fixes by @pacman100 in #676
- DeepSpeed launcher related changes by @pacman100 in #626
- adding torchrun elastic params by @pacman100 in #680
- 🐛 fix by @pacman100 in #683
- Fix skip in dispatch dataloaders by @sgugger in #682
- Clean up DispatchDataloader a bit more by @sgugger in #686
- rng state sync for FSDP by @pacman100 in #688
- Fix DataLoader with samplers that are batch samplers by @sgugger in #687
- fixing support for Apple Silicon GPU in
notebook_launcher
by @pacman100 in #695 - fixing rng sync when using custom sampler and batch_sampler by @pacman100 in #696
- Improve
init_empty_weights
to override tensor constructor by @thomasw21 in #699 - override DeepSpeed
grad_acc_steps
fromaccelerator
obj by @pacman100 in #698 - [doc] Fix 404'd link in memory usage guides by @tomaarsen in #702
- Add in report generation for test failures and make fail-fast false by @muellerzr in #703
- Update runners with report structure, adjust env variable by @muellerzr in #704
- docs: examples readability improvements by @ryanrussell in #709
- docs:
utils
readability fixups by @ryanrussell in #711 - refactor(test_tracking):
key_occurrence
readability fixup by @ryanrussell in #710 - docs:
hooks
readability improvements by @ryanrussell in #712 - sagemaker fixes and improvements by @pacman100 in #708
- refactor(accelerate): readability improvements by @ryanrussell in #713
- More docstring nits by @muellerzr in #715
- Allow custom device placements for different objects by @sgugger in #716
- Specify gradients in model preparation by @muellerzr in #722
- Fix regression issue by @muellerzr in #724
- Fix default for num processes by @sgugger in #726
- Build and Release docker images on a release by @muellerzr in #725
- Make running tests more efficient by @muellerzr in #611
- Fix old naming by @muellerzr in #727
- Fix issue with one-cycle logic by @muellerzr in #728
- Remove auto-bug label in issue template by @sgugger in #735
- Add a tutorial on proper benchmarking by @muellerzr in #734
- Add an example zoo to the documentation by @muellerzr in #737
- trlx by @muellerzr in #738
- Fix memory leak by @muellerzr in #739
- Include examples for CI by @muellerzr in #740
- Auto grad accum example by @muellerzr in #742