Performance considerations for Dataset.iter #7511
-
Other frameworks such as Pytorch let you specify the number of workers for a Dataloader in order to preload batches and keep the GPU utilization high. Are there any experiences with how this works with Huggingface Datasets and the iter method? I am currently using Huggingface Datasets with Jax output and see alternating GPU utilization and a lot of time is spent accessing memory even if I load the dataset completely into memory. I thought that data loading may be one issue at play here. There is a similar issue from two years ago: #6341 , but I am curious if something changed since then. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Just for reference: It seems that using the JAX option for the loader interferes with the async dispatch of Jax. Using the numpy loader resulted in almost twice the iteration speed and much less wait time for the GPU. |
Beta Was this translation helpful? Give feedback.
-
Thank you. I had the same problem, thanks to your answer, the problem is solved. |
Beta Was this translation helpful? Give feedback.
Just for reference: It seems that using the JAX option for the loader interferes with the async dispatch of Jax. Using the numpy loader resulted in almost twice the iteration speed and much less wait time for the GPU.