Skip to content

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Nov 7, 2025

…oded, will need to test and refine extensively for actual release
This pull request introduces a comprehensive streaming mode for large-scale EHR datasets, significantly reducing memory usage by loading patient data from disk on demand. The changes add streaming support to the dataset infrastructure, implement disk-backed patient-level caching with efficient indexing, and provide new APIs for memory-efficient iteration. Additionally, a benchmarking script is added to measure streaming performance and memory efficiency.

Streaming Mode Infrastructure:

  • Added stream and cache_dir parameters to dataset initialization, enabling disk-backed streaming mode for large datasets. Streaming mode uses a cache directory to store and retrieve patient data efficiently, keeping peak memory usage under 2GB regardless of dataset size. [1] [2] [3]
  • Implemented _setup_streaming_cache and _build_patient_cache methods in BaseDataset to create a patient-level Parquet cache and an index for fast lookups, using Polars' streaming execution for minimal memory footprint.

Streaming Patient Iteration:

  • Added iter_patients_streaming() method to BaseDataset, allowing memory-efficient iteration over patients by loading one patient at a time from disk. Supports optional patient filtering and background preloading for reduced latency.
  • Updated iter_patients() to raise an error if called in streaming mode, ensuring users use the correct API for memory efficiency.

API and Import Updates:

  • Exposed IterableSampleDataset in the pyhealth.datasets module for compatibility with streaming workflows.

Benchmarking and Documentation:

  • Added examples/benchmark_streaming.py, a script to benchmark streaming mode performance on the MIMIC-IV StageNet mortality prediction task, reporting memory usage, processing time, cache size, and speedup over normal mode.
  • Provided extensive docstrings and usage examples for new streaming APIs, including error handling and recommendations for best practices. [1] [2] [3]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants