Performance considerations for Dataset.iter #7511

wittenator · 2025-04-14T09:04:15Z

wittenator
Apr 14, 2025

Other frameworks such as Pytorch let you specify the number of workers for a Dataloader in order to preload batches and keep the GPU utilization high. Are there any experiences with how this works with Huggingface Datasets and the iter method? I am currently using Huggingface Datasets with Jax output and see alternating GPU utilization and a lot of time is spent accessing memory even if I load the dataset completely into memory. I thought that data loading may be one issue at play here. There is a similar issue from two years ago: #6341 , but I am curious if something changed since then.

Answered by wittenator

Apr 23, 2025

Just for reference: It seems that using the JAX option for the loader interferes with the async dispatch of Jax. Using the numpy loader resulted in almost twice the iteration speed and much less wait time for the GPU.

View full answer

wittenator · 2025-04-23T15:31:57Z

wittenator
Apr 23, 2025
Author

Just for reference: It seems that using the JAX option for the loader interferes with the async dispatch of Jax. Using the numpy loader resulted in almost twice the iteration speed and much less wait time for the GPU.

0 replies

MalikRamazanov · 2025-05-21T13:55:41Z

MalikRamazanov
May 21, 2025

Thank you. I had the same problem, thanks to your answer, the problem is solved.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance considerations for Dataset.iter #7511

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Performance considerations for Dataset.iter #7511

Uh oh!

Uh oh!

wittenator Apr 14, 2025

Replies: 2 comments

Uh oh!

wittenator Apr 23, 2025 Author

Uh oh!

MalikRamazanov May 21, 2025

wittenator
Apr 14, 2025

wittenator
Apr 23, 2025
Author

MalikRamazanov
May 21, 2025