Skip to content

Conversation

@jsleep
Copy link
Contributor

@jsleep jsleep commented Oct 11, 2021

No description provided.

@jsleep jsleep changed the title backend fix and documentation changes ddp backend fix and documentation changes Oct 11, 2021
gather_frequency = n_samples

gathered = []
n_chunks = n_samples // self.gather_frequency + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krishansubudhi For my understanding , why did we have to chunk before? I assumed it was to avoid exceeding GPU memory limit but it looks like we only move tensors to GPU in this loop and never out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding @aminsaied who also initially created the DDP Trainer backend and chunking logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop first moves the tensors to GPU, then does all gather op, then moves the gathered tensors back to CPU. I believe this was at the request of @gshruti95 at the time for a specific workload that was being tested (keep me honest Shruti).

Copy link
Contributor

@gshruti95 gshruti95 Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided to introduce chunking in case of potential memory or timeout issues when trying to all gather for pretraining workloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this logic will have to change back then, chunking needs to be implemented correctly and not hard coded to 1, but can be for the time being (just will be slow I think).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DDP validation: All gather for flattened 1D tensors taking long time to complete

7 participants