added option for use_consecutive_dataloader to speedup IO#69
added option for use_consecutive_dataloader to speedup IO#69abhinadduri wants to merge 11 commits intomainfrom
Conversation
Summary of ChangesHello @abhinadduri, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a substantial optimization for data loading in the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new use_consecutive_loading option to optimize IO by loading consecutive cells, which is a valuable addition for performance. The implementation is well-integrated across the PerturbationDataModule, PerturbationBatchSampler, and both BatchMappingStrategy and RandomMappingStrategy. The logic for handling consecutive loading, including filling partial sentences and sequentially assigning control cells, appears sound and correctly implemented. Backward compatibility for loading previous states is also handled appropriately. Overall, this is a solid and well-thought-out feature.
d0fce82 to
1ac3c96
Compare
…n, and with persistent workers, for 12x speedup
… the number of validation batches used
this adds a new kwarg to load consecutive cells for faster IO with cells_per_set > 1
requires that the files are sorted by (context, perturbation), e.g., cells with same context and same perturbation group appear consecutively in the file
to achieve this, use the state tx sort utility from the state repository: