Added s3 offloading technique for RDD #51042
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces a new feature to offload the intermediate results of
RDD.collect()
to S3-compatible storage (e.g., MinIO) to reduce driver memory pressure during large result collection. The enhancement adds logic to:The implementation introduces the following new configuration flags:
spark.rdd.collect.offloadToS3.enabled
– enables/disables the offloading logic.spark.rdd.collect.s3.path
– sets the target S3 path for temporary data.spark.rdd.collect.s3.cleanup
– controls whether S3 offload data should be cleaned up after collection.Fallback logic to default
collect()
is implemented for error scenarios (e.g., S3 write failure), ensuring reliability.Why are the changes needed?
The default
RDD.collect()
behavior places the burden of materializing all partition results on the driver, which may lead to OOM errors when collecting large datasets. By offloading partition results to S3 during task execution and streaming them back in the driver, we significantly reduce the driver's memory footprint.This is especially helpful for:
Does this PR introduce any user-facing change?
Yes.
This PR introduces three new user-facing Spark configuration properties:
spark.rdd.collect.offloadToS3.enabled
(default: false)spark.rdd.collect.s3.path
(no default; must be explicitly set)spark.rdd.collect.s3.cleanup
(default: true)If offloading is enabled and properly configured, the driver no longer receives all partitions' data in memory directly from the executors.
How was this patch tested?
This feature was tested via a custom multi-phase validation suite using
spark-shell
, structured as follows:offload
,cleanup
).collect()
logic.Testing included checking:
assert()
s on outputFull testing guide is included in project documentation for reproducibility.
Was this patch authored or co-authored using generative AI tooling?
No.