Skip to content

fix: Fallback to Spark when PARQUET_FIELD_ID_READ_ENABLED=true for new native scans #1757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented May 20, 2025

Which issue does this PR close?

Part of #1758

Rationale for this change

Fix some Spark SQL test failures when the new native scans are enabled

What changes are included in this PR?

How are these changes tested?

I manually tested with the following test:

testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetFieldIdIOSuite -- -z "read parquet file without ids"

Comment on lines +57 to +61
if (SQLConf.get.getConf(
SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != CometConf.SCAN_NATIVE_COMET) {
withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")
return plan
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change. The rest is refactoring.

@andygrove andygrove marked this pull request as draft May 20, 2025 16:03
@andygrove andygrove marked this pull request as ready for review May 20, 2025 16:05
@andygrove andygrove marked this pull request as draft May 20, 2025 16:12
@andygrove
Copy link
Member Author

This may make too many tests fall back because Spark may be enabling this by default in all tests ... investigating

@codecov-commenter
Copy link

codecov-commenter commented May 20, 2025

Codecov Report

Attention: Patch coverage is 65.21739% with 8 lines in your changes missing coverage. Please review.

Project coverage is 58.56%. Comparing base (f09f8af) to head (4a84257).
Report is 199 commits behind head on main.

Files with missing lines Patch % Lines
...n/scala/org/apache/comet/rules/CometScanRule.scala 65.21% 3 Missing and 5 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1757      +/-   ##
============================================
+ Coverage     56.12%   58.56%   +2.43%     
- Complexity      976     1133     +157     
============================================
  Files           119      130      +11     
  Lines         11743    12686     +943     
  Branches       2251     2369     +118     
============================================
+ Hits           6591     7429     +838     
- Misses         4012     4065      +53     
- Partials       1140     1192      +52     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@parthchandra
Copy link
Contributor

Pre-emptively approving this. We can defer field_id support for native readers for the time being.

@andygrove
Copy link
Member Author

andygrove commented May 20, 2025

SQLConf.PARQUET_FIELD_ID_READ_ENABLED is enabled in all Spark tests, so not sure what to do about this now.

private[sql] object TestSQLContext {

  /**
   * A map used to store all confs that need to be overridden in sql/core unit tests.
   */
  val overrideConfs: Map[String, String] =
    Map(
      // Fewer shuffle partitions to speed up testing.
      SQLConf.SHUFFLE_PARTITIONS.key -> "5",
      // Enable parquet read field id for tests to ensure correctness
      // By default, if Spark schema doesn't contain the `parquet.field.id` metadata,
      // the underlying matching mechanism should behave exactly like name matching
      // which is the existing behavior. Therefore, turning this on ensures that we didn't
      // introduce any regression for such mixed matching mode.
      SQLConf.PARQUET_FIELD_ID_READ_ENABLED.key -> "true")
}

@parthchandra
Copy link
Contributor

Ouch.

val scanImpl: String = COMET_NATIVE_SCAN_IMPL.get()
if (SQLConf.get.getConf(
SQLConf.PARQUET_FIELD_ID_READ_ENABLED) && scanImpl != CometConf.SCAN_NATIVE_COMET) {
withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
withInfo(plan, s"Comet $scanImpl scan does not support PARQUET_FIELD_ID_READ_ENABLED")
withInfo(plan, s"Comet $scanImpl scan does not support with enabled `spark.sql.parquet.fieldId.read.enabled`")

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @andygrove

@andygrove
Copy link
Member Author

I will close this since this will make all Spark SQL tests fall back to Spark. We seem to mostly support this feature in native_iceberg_compat already. I guess we have no choice but to add support in native_datafusion.

@andygrove andygrove closed this May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants