feat: Parquet Modular Encryption with Spark KMS for native readers #2447

mbutrovich · 2025-09-23T19:34:34Z

Which issue does this PR close?

Closes #.

Rationale for this change

We want to add Parquet Module Encryption support for the native readers when using a Spark KMS. We use the encryption factory features added in DataFusion 50 to register an encryption factory that uses JNI to get decryption keys from Spark.

What changes are included in this PR?

How are these changes tested?

Existing PME tests with new readers added.
New tests that exercise PME options like plaintext footer, etc.

…ark side accessed via JNI.

codecov-commenter · 2025-09-23T19:58:41Z

Codecov Report

❌ Patch coverage is 36.78161% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.28%. Comparing base (f09f8af) to head (e9fcca7).
⚠️ Report is 562 commits behind head on main.

Files with missing lines	Patch %	Lines
...rg/apache/comet/parquet/CometFileKeyUnwrapper.java	0.00%	18 Missing ⚠️
...a/org/apache/comet/parquet/CometParquetUtils.scala	0.00%	15 Missing ⚠️
...ain/scala/org/apache/comet/CometExecIterator.scala	33.33%	7 Missing and 1 partial ⚠️
...va/org/apache/comet/parquet/NativeBatchReader.java	0.00%	5 Missing ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala	80.76%	3 Missing and 2 partials ⚠️
...n/scala/org/apache/comet/rules/CometScanRule.scala	42.85%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2447      +/-   ##
============================================
+ Coverage     56.12%   58.28%   +2.16%     
- Complexity      976     1436     +460     
============================================
  Files           119      147      +28     
  Lines         11743    13567    +1824     
  Branches       2251     2360     +109     
============================================
+ Hits           6591     7908    +1317     
- Misses         4012     4428     +416     
- Partials       1140     1231      +91

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

native/core/src/parquet/parquet_exec.rs

…yption factory registration in parquet_exec.rs.

parthchandra · 2025-09-24T22:07:15Z

Also look at https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/crypto/TestPropertiesDrivenEncryption.java to see if there are any tests that might be relevant here.

# Conflicts: # spark/src/main/scala/org/apache/comet/CometExecIterator.scala

parthchandra · 2025-09-29T17:34:38Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+  // Each hadoopConf yields a unique DecryptionPropertiesFactory. While it's unlikely that
+  // this Comet plan contains more than one hadoopConf, we don't want to assume that. So we'll
+  // provide the ability to cache more than one Factory with a map.
+  private final ConcurrentHashMap<Configuration, DecryptionPropertiesFactory> factoryCache =


There is only one hadoop conf in a spark session so this may be overkill.

Session hadoopConf is not what the scans use though. They add all the relation options (Parquet options like encryption keys) to the hadoopConf, so each scan can have a unique hadoopConf. Whether we could have a Comet plan with multiple Parquet scans is the real question.

Whether we could have a Comet plan with multiple Parquet scans is the real question.

I don't know what you mean by this. What exactly are you calling a Parquet scan?

A scan node in a plan tree, specifically a stage that gets converted to a Comet native plan.

I'll simplify this with a couple of assertions that a Comet plan should only have one scan node in it.

I see what you mean now. What about a plan with a union ?

@mbutrovich I had this one open question about a plan with a union that may have more than one scan. Can you verify this will not be an issue.

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java

common/src/main/scala/org/apache/comet/parquet/CometParquetUtils.scala

spark/src/main/scala/org/apache/spark/sql/comet/operators.scala

mbutrovich · 2025-09-30T17:02:27Z

Results attached from the benchmark I added to CometReadBenchmark, and a small chart with highlights to see what the overhead of encryption is for the various readers.

benchmark_decryption.txt

native/core/src/parquet/encryption_support.rs

parthchandra · 2025-09-30T22:52:27Z

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

+  // Each hadoopConf yields a unique DecryptionPropertiesFactory. While it's unlikely that
+  // this Comet plan contains more than one hadoopConf, we don't want to assume that. So we'll
+  // provide the ability to cache more than one Factory with a map.
+  private final ConcurrentHashMap<Configuration, DecryptionPropertiesFactory> factoryCache =


Whether we could have a Comet plan with multiple Parquet scans is the real question.

I don't know what you mean by this. What exactly are you calling a Parquet scan?

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java

native/core/src/parquet/encryption_support.rs

…and Factory caching.

Parquet Modular Encryption support for native readers using KMS on Sp…

dcc882b

…ark side accessed via JNI.

mbutrovich changed the title ~~feat: Parquet Modular Encryption support for native_datafusion and native_iceberg_compat readers~~ feat: Parquet Modular Encryption with Spark KMS for native_datafusion and native_iceberg_compat readers Sep 23, 2025

mbutrovich changed the title ~~feat: Parquet Modular Encryption with Spark KMS for native_datafusion and native_iceberg_compat readers~~ feat: Parquet Modular Encryption with Spark KMS for native readers Sep 23, 2025

Fix unused import.

8bac76a

hsiang-c reviewed Sep 23, 2025

View reviewed changes

common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java Outdated Show resolved Hide resolved

hsiang-c reviewed Sep 23, 2025

View reviewed changes

native/core/src/parquet/parquet_exec.rs Outdated Show resolved Hide resolved

hsiang-c reviewed Sep 23, 2025

View reviewed changes

native/core/src/parquet/parquet_exec.rs Outdated Show resolved Hide resolved

mbutrovich added 4 commits September 23, 2025 18:06

Fix encryptionEnabled check in NativeBatchReader.java, and guard encr…

40935df

…yption factory registration in parquet_exec.rs.

Fix NPE when checking encryptedEnabled.

7cbfb1b

Merge branch 'main' into decryption

1e1fa2f

Minor refactor for encryptionEnabled.

090497b

mbutrovich added 7 commits September 26, 2025 10:38

Merge branch 'main' into decryption

992a4e1

More tests.

c9dfdd5

Cleanup Seq loop that wasn't doing anything.

bf0bec4

Merge branch 'main' into decryption

a0e2d9a

Docs.

271e940

Docs.

571c881

Refactor out of parquet_exec.rs.

4dde7fb

mbutrovich marked this pull request as ready for review September 26, 2025 20:31

mbutrovich added 2 commits September 29, 2025 10:13

Merge branch 'main' into decryption

ac566f5

# Conflicts: # spark/src/main/scala/org/apache/comet/CometExecIterator.scala

Add uniform encryption test.

9bc24fd

parthchandra reviewed Sep 29, 2025

View reviewed changes

mbutrovich added 3 commits September 30, 2025 07:47

Merge branch 'main' into decryption

1dfb252

Address PR feedback.

bf6ad03

Add benchmark.

7d1bf39

parthchandra reviewed Sep 30, 2025

View reviewed changes

martin-g reviewed Oct 1, 2025

View reviewed changes

common/src/main/java/org/apache/comet/parquet/CometFileKeyUnwrapper.java Outdated Show resolved Hide resolved

native/core/src/parquet/encryption_support.rs Show resolved Hide resolved

native/core/src/parquet/encryption_support.rs Outdated Show resolved Hide resolved

mbutrovich added 2 commits October 1, 2025 10:54

Address PR feedback related to number of hadoopConfs in a Comet plan …

8ba2680

…and Factory caching.

Adjust error handling.

e9fcca7

mbutrovich requested a review from parthchandra October 3, 2025 21:25

feat: Parquet Modular Encryption with Spark KMS for native readers #2447

Are you sure you want to change the base?

feat: Parquet Modular Encryption with Spark KMS for native readers #2447

Uh oh!

Conversation

mbutrovich commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parthchandra commented Sep 24, 2025

Uh oh!

parthchandra Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

parthchandra Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

parthchandra Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

parthchandra Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Sep 30, 2025

Uh oh!

Uh oh!

parthchandra Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented Sep 23, 2025 •

edited

Loading

codecov-commenter commented Sep 23, 2025 •

edited

Loading

mbutrovich Sep 30, 2025 •

edited

Loading

mbutrovich Oct 1, 2025 •

edited

Loading