-
Notifications
You must be signed in to change notification settings - Fork 630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The shuffle_after_epoch
parameter in fn.readers.numpy
is independent of the seed
.
#5827
Comments
shuffle_after_epoch
parameter in fn.readers.numpy
is independent of seed
.shuffle_after_epoch
parameter in fn.readers.numpy
is independent of the seed
.
Hi @acecchini, Thank you for reaching out.
This was a conscious decision to proceed with such a design choice. The rationale is to make sure that the shuffling has the same pattern across the DALI instances running on different GPUs to make sure that shards don't overlap. Another option we considered was to ask the user to provide the same seed across the DALI pipeline instances.
Can you tell us more about your use case? Why the default shuffling mode that doesn't fit your needs? |
Hi @JanuszL, Thank you for your prompt response.
I do believe that is a much better design choice. In fact I use DALI together with Jax and the DALI Jax's
Well, first of all, from a theoretical standpoint, it violates some hypothesis over the sampling process. In machine learning, and statistics more generally, we first design a theoretical model assuming that we can sample from the true distribution to which the data belongs to. In practice, however, we only have access to a finite dataset, which is sufficiently large to assume being in the law of large number regime. We then associate a discrete uniform distribution to this dataset and construct an estimator of our loss by sampling from this distribution. In most usecase, because of how dataloader are designed, instead of sampling uniformly we sample without replacement from the dataset until exhaustion (which is corresponds to an epoch cycle) and do this again for a sufficiently large number of cycles. We then construct a new estimator of our loss with this new distribution. However, in order for our estimator to converge to the true loss, we assume that the sampling process is actually random, otherwise this estimator is not theoretically guaranteed to converge. Even if the shuffling is different for each epoch, the corresponding permutations associated and their order will remain the same for every new training, which violates the true randomness assumption. On a more intuitive perspective, that means during training you will always encounter the same datapoints in the same order, and therefore the search space will necessarily be restricted. Suppose the neural network is a function |
Hi,
As expressed in issue #4319 and since it hasn't been solved yet (not sure why this issue was closed), I kindly would like to know if it is possible to make
shuffle_after_epoch
in thefn.readers.numpy
dependent on theseed
parameter? Currently the dataset is shuffled in a different way after every epoch but if I restart the same iterator with a different seed I will still get the same shuffled datasets after every epoch in the same order.The
fn.readers.numpy
is the only data reader (as far as I know) which provides GPUDirect storage support, which I do think gives NVidia Dali dataloader an edge. I really think this a major flaw in the design of the numpy reader, would be great if you could solve this issue.Thanks :)
The text was updated successfully, but these errors were encountered: