Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The shuffle_after_epoch parameter in fn.readers.numpy is independent of the seed. #5827

Open
acecchini opened this issue Feb 21, 2025 · 2 comments
Assignees

Comments

@acecchini
Copy link

Hi,

As expressed in issue #4319 and since it hasn't been solved yet (not sure why this issue was closed), I kindly would like to know if it is possible to make shuffle_after_epoch in the fn.readers.numpy dependent on the seed parameter? Currently the dataset is shuffled in a different way after every epoch but if I restart the same iterator with a different seed I will still get the same shuffled datasets after every epoch in the same order.

The fn.readers.numpy is the only data reader (as far as I know) which provides GPUDirect storage support, which I do think gives NVidia Dali dataloader an edge. I really think this a major flaw in the design of the numpy reader, would be great if you could solve this issue.

Thanks :)

@acecchini acecchini changed the title The shuffle_after_epoch parameter in fn.readers.numpy is independent of seed. The shuffle_after_epoch parameter in fn.readers.numpy is independent of the seed. Feb 21, 2025
@JanuszL
Copy link
Contributor

JanuszL commented Feb 24, 2025

Hi @acecchini,

Thank you for reaching out.

I really think this a major flaw in the design of the numpy reader, would be great if you could solve this issue.

This was a conscious decision to proceed with such a design choice. The rationale is to make sure that the shuffling has the same pattern across the DALI instances running on different GPUs to make sure that shards don't overlap. Another option we considered was to ask the user to provide the same seed across the DALI pipeline instances.

The fn.readers.numpy is the only data reader (as far as I know) which provides GPUDirect storage support

Can you tell us more about your use case? Why the default shuffling mode that doesn't fit your needs?

@acecchini
Copy link
Author

acecchini commented Feb 24, 2025

Hi @JanuszL,

Thank you for your prompt response.

Another option we considered was to ask the user to provide the same seed across the DALI pipeline instances.

I do believe that is a much better design choice. In fact I use DALI together with Jax and the DALI Jax's data_iterator does eaxctly this. It passes the same seed to all the pipelines (but of course passes different shard_id and device_id).

Can you tell us more about your use case? Why the default shuffling mode that doesn't fit your needs?

Well, first of all, from a theoretical standpoint, it violates some hypothesis over the sampling process. In machine learning, and statistics more generally, we first design a theoretical model assuming that we can sample from the true distribution to which the data belongs to. In practice, however, we only have access to a finite dataset, which is sufficiently large to assume being in the law of large number regime. We then associate a discrete uniform distribution to this dataset and construct an estimator of our loss by sampling from this distribution. In most usecase, because of how dataloader are designed, instead of sampling uniformly we sample without replacement from the dataset until exhaustion (which is corresponds to an epoch cycle) and do this again for a sufficiently large number of cycles. We then construct a new estimator of our loss with this new distribution. However, in order for our estimator to converge to the true loss, we assume that the sampling process is actually random, otherwise this estimator is not theoretically guaranteed to converge. Even if the shuffling is different for each epoch, the corresponding permutations associated and their order will remain the same for every new training, which violates the true randomness assumption.

On a more intuitive perspective, that means during training you will always encounter the same datapoints in the same order, and therefore the search space will necessarily be restricted. Suppose the neural network is a function f of weights theta and data input x; and the loss L is function of f(x, theta) and the data output y. Since the order D=((x_i, y_i))_i is determined (read here a data sequence with multiple epochs concatenated), the gradients' sequence involved in the gradient descent will always be a function of D and of the weights theta, the latter being the only source of randomness since we sample it randomly for every new training. That means that the search space and the local minimas in which the weights of the model could potentially fall into, is mechanically restricted. Gradients applied at the beginning of a training do not have the same influence on the training dynamics than the one applied towards the end or at intermediate steps. Furthermore, we generally apply an optimizer schedule over the learning rate which adds even more dependence of the training dynamics over the gradients sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants