init commit for testing memory streaming, honestly this is all vibe c… #601

jhnwu3 · 2025-11-07T17:59:10Z

…oded, will need to test and refine extensively for actual release
This pull request introduces a comprehensive streaming mode for large-scale EHR datasets, significantly reducing memory usage by loading patient data from disk on demand. The changes add streaming support to the dataset infrastructure, implement disk-backed patient-level caching with efficient indexing, and provide new APIs for memory-efficient iteration. Additionally, a benchmarking script is added to measure streaming performance and memory efficiency.

Streaming Mode Infrastructure:

Added stream and cache_dir parameters to dataset initialization, enabling disk-backed streaming mode for large datasets. Streaming mode uses a cache directory to store and retrieve patient data efficiently, keeping peak memory usage under 2GB regardless of dataset size. [1] [2] [3]
Implemented _setup_streaming_cache and _build_patient_cache methods in BaseDataset to create a patient-level Parquet cache and an index for fast lookups, using Polars' streaming execution for minimal memory footprint.

Streaming Patient Iteration:

Added iter_patients_streaming() method to BaseDataset, allowing memory-efficient iteration over patients by loading one patient at a time from disk. Supports optional patient filtering and background preloading for reduced latency.
Updated iter_patients() to raise an error if called in streaming mode, ensuring users use the correct API for memory efficiency.

API and Import Updates:

Exposed IterableSampleDataset in the pyhealth.datasets module for compatibility with streaming workflows.

Benchmarking and Documentation:

Added examples/benchmark_streaming.py, a script to benchmark streaming mode performance on the MIMIC-IV StageNet mortality prediction task, reporting memory usage, processing time, cache size, and speedup over normal mode.
Provided extensive docstrings and usage examples for new streaming APIs, including error handling and recommendations for best practices. [1] [2] [3]

…oded, will need to test and refine extensively for actual release

…ataset

jhnwu3 added 13 commits November 7, 2025 11:40

init commit for testing memory streaming, honestly this is all vibe c…

fbff5e9

…oded, will need to test and refine extensively for actual release

more refactors

7cc6842

more changes to make it work

e24a6e2

testing if this passes CI

dbd9c08

commit big update

49dee08

Merge branch 'master' into add/iterable_dataset_memory_management

216ffa8

more changes to api to ensure logical consistency here

94f17e5

minor updates to test training with iterable dataset

4562532

merge conflicts

1470076

new commits to fix bugs with training batch-wise with IterableSampleD…

38ec291

…ataset

Merge branch 'master' into add/iterable_dataset_memory_management

73e34a1

fix for mockdataset not really having a real streamA

b53acbf

more bug fixes and optimizations

78c9fa8

jhnwu3 requested review from LogicFan and zzachw November 18, 2025 18:04

LogicFan mentioned this pull request Nov 19, 2025

[Memory] Fix large memory usage during __init__ call. #620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

init commit for testing memory streaming, honestly this is all vibe c… #601

init commit for testing memory streaming, honestly this is all vibe c… #601

Uh oh!

jhnwu3 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

init commit for testing memory streaming, honestly this is all vibe c… #601

Are you sure you want to change the base?

init commit for testing memory streaming, honestly this is all vibe c… #601

Uh oh!

Conversation

jhnwu3 commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants