Skip to content

Conversation

rjzamora
Copy link
Contributor

@rjzamora rjzamora commented May 3, 2022

WARNING: This PR is still very rough, but I am sharing it now in case others wish to experiment (and/or share feedback)

May address parts of NVIDIA-Merlin/NVTabular#1340

This PR currently adds a balance_partitions= option to the Dataset API, which ultimately uses a new BalancedParquetEngine implementation for engine="parquet". The primary goal of this engine is to generate an underlying Dask DataFrame collection with an equivalent row count in every partition. The user may manually specify this desired row count (with rows_per_partition), or they can pass in the usual part_size or part_mem_fraction.

An additional option is to specify a desired batch_size to align with. If the batch_size argument is specified, the output partition sizes will always be divisible by this number.

TODO:

  • Handle hive/directory-partitioned datasets
  • Add an option to specify/guide the total number of desired partitions (for multi-gpu workflows)
  • More testing, iteration and general cleanup
  • Does a standalone engine make sense for this?

@github-actions
Copy link

github-actions bot commented May 3, 2022

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant