init commit for testing memory streaming, honestly this is all vibe c… #601
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…oded, will need to test and refine extensively for actual release
This pull request introduces a comprehensive streaming mode for large-scale EHR datasets, significantly reducing memory usage by loading patient data from disk on demand. The changes add streaming support to the dataset infrastructure, implement disk-backed patient-level caching with efficient indexing, and provide new APIs for memory-efficient iteration. Additionally, a benchmarking script is added to measure streaming performance and memory efficiency.
Streaming Mode Infrastructure:
streamandcache_dirparameters to dataset initialization, enabling disk-backed streaming mode for large datasets. Streaming mode uses a cache directory to store and retrieve patient data efficiently, keeping peak memory usage under 2GB regardless of dataset size. [1] [2] [3]_setup_streaming_cacheand_build_patient_cachemethods inBaseDatasetto create a patient-level Parquet cache and an index for fast lookups, using Polars' streaming execution for minimal memory footprint.Streaming Patient Iteration:
iter_patients_streaming()method toBaseDataset, allowing memory-efficient iteration over patients by loading one patient at a time from disk. Supports optional patient filtering and background preloading for reduced latency.iter_patients()to raise an error if called in streaming mode, ensuring users use the correct API for memory efficiency.API and Import Updates:
IterableSampleDatasetin thepyhealth.datasetsmodule for compatibility with streaming workflows.Benchmarking and Documentation:
examples/benchmark_streaming.py, a script to benchmark streaming mode performance on the MIMIC-IV StageNet mortality prediction task, reporting memory usage, processing time, cache size, and speedup over normal mode.