Skip to content

Conversation

@rjzamora
Copy link
Member

Description

Updates cudf-polars row-group sampling to improve the ParquetSourceInfo.storage_size estimate used to make partitioning decisions.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Nov 10, 2025
@rjzamora rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 10, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 10, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Nov 10, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Nov 10, 2025
@rjzamora rjzamora marked this pull request as ready for review November 12, 2025 13:29
@rjzamora rjzamora requested a review from a team as a code owner November 12, 2025 13:29
@rjzamora rjzamora requested a review from wence- November 12, 2025 13:29
@rjzamora rjzamora requested a review from a team as a code owner November 12, 2025 13:29
@rjzamora rjzamora requested a review from Matt711 November 12, 2025 13:29
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 12, 2025
@Matt711
Copy link
Contributor

Matt711 commented Nov 12, 2025

For example, when I use --blocksize 1_000_000_000 for query 1 of PDS-H, the in-memory size of each partition is actually ~3_000_000_000 bytes (3x larger than the intended partition size). This difference was even larger (3.7x) before #20193 was merged. That PR set a minimum bound on the storage-size estimate. That change improved OOM stability a bit, but the ParquetSourceInfo.storage_size estimate is still very low for query 1 of PDS-H.

How do things look after this PR?

@rjzamora
Copy link
Member Author

How do things look after this PR?

This makes the partition size very accurate, because we sample a real row-group.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 22, 2025
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit f8d8619 into rapidsai:main Nov 22, 2025
139 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Nov 22, 2025
@rjzamora rjzamora deleted the sample-rg-size branch November 22, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[BUG] CuDF-Polars produces oversized partitions for dictionary-encoded Parquet data

3 participants