fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans #1757

andygrove · 2025-05-20T15:52:08Z

Which issue does this PR close?

Part of #1758

Rationale for this change

Fix some Spark SQL test failures when the new native scans are enabled

What changes are included in this PR?

How are these changes tested?

I manually tested with the following test:

testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetFieldIdIOSuite -- -z "read parquet file without ids"

andygrove · 2025-05-20T15:53:30Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

+    if (SQLConf.get.getConf(
+        SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != CometConf.SCAN_NATIVE_COMET) {
+      withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")
+      return plan
+    }


This is the change. The rest is refactoring.

andygrove · 2025-05-20T16:13:48Z

This may make too many tests fall back because Spark may be enabling this by default in all tests ... investigating

codecov-commenter · 2025-05-20T17:17:22Z

Codecov Report

Attention: Patch coverage is 65.21739% with 8 lines in your changes missing coverage. Please review.

Project coverage is 58.56%. Comparing base (f09f8af) to head (4a84257).
Report is 199 commits behind head on main.

Files with missing lines	Patch %	Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala	65.21%	3 Missing and 5 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1757      +/-   ##
============================================
+ Coverage     56.12%   58.56%   +2.43%     
- Complexity      976     1133     +157     
============================================
  Files           119      130      +11     
  Lines         11743    12686     +943     
  Branches       2251     2369     +118     
============================================
+ Hits           6591     7429     +838     
- Misses         4012     4065      +53     
- Partials       1140     1192      +52

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parthchandra · 2025-05-20T17:38:45Z

Pre-emptively approving this. We can defer field_id support for native readers for the time being.

andygrove · 2025-05-20T17:59:38Z

SQLConf.PARQUET_FIELD_ID_READ_ENABLED is enabled in all Spark tests, so not sure what to do about this now.

private[sql] object TestSQLContext {

  /**
   * A map used to store all confs that need to be overridden in sql/core unit tests.
   */
  val overrideConfs: Map[String, String] =
    Map(
      // Fewer shuffle partitions to speed up testing.
      SQLConf.SHUFFLE_PARTITIONS.key -> "5",
      // Enable parquet read field id for tests to ensure correctness
      // By default, if Spark schema doesn't contain the `parquet.field.id` metadata,
      // the underlying matching mechanism should behave exactly like name matching
      // which is the existing behavior. Therefore, turning this on ensures that we didn't
      // introduce any regression for such mixed matching mode.
      SQLConf.PARQUET_FIELD_ID_READ_ENABLED.key -> "true")
}

parthchandra · 2025-05-20T20:49:21Z

Ouch.

comphead · 2025-05-20T22:01:59Z

spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

+    val scanImpl: String = COMET_NATIVE_SCAN_IMPL.get()
+    if (SQLConf.get.getConf(
+        SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != CometConf.SCAN_NATIVE_COMET) {
+      withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")


Suggested change

withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")

withInfo(plan, s"Comet $scanImpl scan does not support with enabled `spark.sql.parquet.fieldId.read.enabled`")

comphead

lgtm thanks @andygrove

andygrove · 2025-05-21T14:20:00Z

I will close this since this will make all Spark SQL tests fall back to Spark. We seem to mostly support this feature in native_iceberg_compat already. I guess we have no choice but to add support in native_datafusion.

andygrove added 2 commits May 20, 2025 09:48

Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true

adabd59

fix

628eb6f

andygrove commented May 20, 2025

View reviewed changes

fix

4a84257

andygrove marked this pull request as draft May 20, 2025 16:03

andygrove marked this pull request as ready for review May 20, 2025 16:05

andygrove requested review from comphead and parthchandra May 20, 2025 16:12

andygrove marked this pull request as draft May 20, 2025 16:12

parthchandra approved these changes May 20, 2025

View reviewed changes

comphead reviewed May 20, 2025

View reviewed changes

comphead approved these changes May 20, 2025

View reviewed changes

andygrove closed this May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans #1757

fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans #1757

Uh oh!

andygrove commented May 20, 2025 •

edited

Loading

Uh oh!

andygrove May 20, 2025

Uh oh!

andygrove commented May 20, 2025

Uh oh!

codecov-commenter commented May 20, 2025 •

edited

Loading

Uh oh!

parthchandra commented May 20, 2025

Uh oh!

andygrove commented May 20, 2025 •

edited

Loading

Uh oh!

parthchandra commented May 20, 2025

Uh oh!

comphead May 20, 2025

Uh oh!

comphead left a comment

Uh oh!

andygrove commented May 21, 2025

Uh oh!

Uh oh!

	withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")
	withInfo(plan, s"Comet $scanImpl scan does not support with enabled `spark.sql.parquet.fieldId.read.enabled`")

fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans #1757

fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans #1757

Uh oh!

Conversation

andygrove commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove May 20, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 20, 2025

Uh oh!

codecov-commenter commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

parthchandra commented May 20, 2025

Uh oh!

andygrove commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthchandra commented May 20, 2025

Uh oh!

comphead May 20, 2025

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 21, 2025

Uh oh!

Uh oh!

andygrove commented May 20, 2025 •

edited

Loading

codecov-commenter commented May 20, 2025 •

edited

Loading

andygrove commented May 20, 2025 •

edited

Loading