-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable use of IterableDataset when training with DDP #681
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you! |
Same issue, DDP breaks on using
Instead of
And passing this to the |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
Feature request
Enable use of IterableDataset when training with NeuronTrainer and DDP. Or is there a design limitation that prevents this?
I can't share the project code, but see below another case for simplicity, which produces the same issue. DistributedSampler expects a dataset with known length, which a IterableDataset doesn't have by design.
Setup
OS: Ubuntu 22.04.4 LTS (kernel 6.5.0-1023-aws)
apt packages
pip packages
Command:
torchrun --nproc_per_node=2 issue.py
Code (issue.py)
Issue
Motivation
Have a project for distributed training on Trainium with DDP that requires use of HuggingFace's IterableDataset (when
streaming=True
in load.py/load_dataset() from package datasets==2.19.0)Your contribution
N/A. I noticed on Nvidia A100 GPUs (with transformers Trainer) that it uses accelerate.data_loader.DataLoaderDispatcher and does not use DistributedSampler.
The text was updated successfully, but these errors were encountered: