-
Notifications
You must be signed in to change notification settings - Fork 244
feat: Parquet Modular Encryption with Spark KMS for native readers #2447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ark side accessed via JNI.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2447 +/- ##
============================================
+ Coverage 56.12% 58.28% +2.16%
- Complexity 976 1436 +460
============================================
Files 119 147 +28
Lines 11743 13567 +1824
Branches 2251 2360 +109
============================================
+ Hits 6591 7908 +1317
- Misses 4012 4428 +416
- Partials 1140 1231 +91 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java
Outdated
Show resolved
Hide resolved
…yption factory registration in parquet_exec.rs.
Also look at https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java to see if there are any tests that might be relevant here. |
# Conflicts: # spark/src/main/scala/org/apache/comet/CometExecIterator.scala
// Each hadoopConf yields a unique DecryptionPropertiesFactory. While it's unlikely that | ||
// this Comet plan contains more than one hadoopConf, we don't want to assume that. So we'll | ||
// provide the ability to cache more than one Factory with a map. | ||
private final ConcurrentHashMap<Configuration, DecryptionPropertiesFactory> factoryCache = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one hadoop conf in a spark session so this may be overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Session hadoopConf is not what the scans use though. They add all the relation options (Parquet options like encryption keys) to the hadoopConf, so each scan can have a unique hadoopConf. Whether we could have a Comet plan with multiple Parquet scans is the real question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether we could have a Comet plan with multiple Parquet scans is the real question.
I don't know what you mean by this. What exactly are you calling a Parquet scan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A scan node in a plan tree, specifically a stage that gets converted to a Comet native plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll simplify this with a couple of assertions that a Comet plan should only have one scan node in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean now. What about a plan with a union
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbutrovich I had this one open question about a plan with a union that may have more than one scan. Can you verify this will not be an issue.
common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java
Show resolved
Hide resolved
common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java
Outdated
Show resolved
Hide resolved
common/src/main/scala/org/apache/comet/parquet/CometParquetUtils.scala
Outdated
Show resolved
Hide resolved
spark/src/main/scala/org/apache/spark/sql/comet/operators.scala
Outdated
Show resolved
Hide resolved
// Each hadoopConf yields a unique DecryptionPropertiesFactory. While it's unlikely that | ||
// this Comet plan contains more than one hadoopConf, we don't want to assume that. So we'll | ||
// provide the ability to cache more than one Factory with a map. | ||
private final ConcurrentHashMap<Configuration, DecryptionPropertiesFactory> factoryCache = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether we could have a Comet plan with multiple Parquet scans is the real question.
I don't know what you mean by this. What exactly are you calling a Parquet scan?
Which issue does this PR close?
Closes #.
Rationale for this change
We want to add Parquet Module Encryption support for the native readers when using a Spark KMS. We use the encryption factory features added in DataFusion 50 to register an encryption factory that uses JNI to get decryption keys from Spark.
What changes are included in this PR?
How are these changes tested?