fix: default values for native_datafusion scan #1756

mbutrovich · 2025-05-20T14:28:38Z

This change is only needed for native_datafusion since native_iceberg_compat works at the column granularity and seems to populate that fine.

Which issue does this PR close?

Closes #1750.

Rationale for this change

What changes are included in this PR?

Serialize default values from the query plan, deserialize on native side, and embed in SchemaMapper.
Default null columns now use compact scalar representation for the column.
Refactored some old comments and unused fields in SparkParquetOptions.

How are these changes tested?

@andygrove's simple unit test, and added a new fuzz test that tests all primitive types for default values.

mbutrovich · 2025-05-20T17:36:25Z

native/core/src/parquet/mod.rs

@@ -715,6 +715,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat
            file_groups,
            None,
            data_filters,
+            None,


As far as I can tell, missing columns for native_iceberg_compat are handled elsewhere and the DataSourceExec will never know about them.

It's handled in the ConstantColumnReader which is shared between native_comet and native_iceberg_compat.
Also see ResolveDefaultColumns.getExistenceDefaultValues. Not quite sure what the difference between ExistenceDefaultValues and simply default values is.

From Spark javadoc -

org. apache. spark. sql. catalyst. util. ResolveDefaultColumns def constantFoldCurrentDefaultsToExistDefaults(tableSchema: StructType, statementType: String): StructType Finds "current default" expressions in CREATE/ REPLACE TABLE columns and constant-folds them. The results are stored in the "exists default" metadata of the same columns. For example, in the event of this statement: CREATE TABLE T(a INT, b INT DEFAULT 5 + 5) This method constant-folds the "current default" value, stored in the CURRENT_DEFAULT metadata of the "b" column, to "10", storing the result in the "exists default" value within the EXISTS_DEFAULT metadata of that same column. Meanwhile the "current default" metadata of this "b" column retains its original value of "5 + 5". The reason for constant-folding the EXISTS_DEFAULT is to make the end-user visible behavior the same, after executing an ALTER TABLE ADD COLUMNS command with DEFAULT value, as if the system had performed an exhaustive backfill of the provided value to all previously existing rows in the table instead. We choose to avoid doing such a backfill because it would be a time-consuming and costly operation. Instead, we elect to store the EXISTS_DEFAULT in the column metadata for future reference when querying data out of the data source. In turn, each data source then takes responsibility to provide the constant-folded value in the EXISTS_DEFAULT metadata for such columns where the value is not present in storage.

I'll assume that the default values you get are the 'existence' defaults

mbutrovich · 2025-05-20T17:37:00Z

native/core/src/parquet/parquet_exec.rs

-    );
+    let mut parquet_source =
+        ParquetSource::new(table_parquet_options).with_schema_adapter_factory(Arc::new(
+            SparkSchemaAdapterFactory::new(spark_parquet_options, default_values),


We can discuss if it makes more sense to stick default_values inside of the SparkParquetOptions struct.

I don't think it makes sense to do that even though it might make the code. a little bit simpler. default_values are not exactly options. But I'm not going to argue if you choose to do it that way.

mbutrovich · 2025-05-20T17:37:33Z

native/core/src/parquet/parquet_support.rs

@@ -60,9 +60,6 @@ pub struct SparkParquetOptions {
    pub allow_incompat: bool,
    /// Support casting unsigned ints to signed ints (used by Parquet SchemaAdapter)
    pub allow_cast_unsigned_ints: bool,
-    /// We also use the cast logic for adapting Parquet schemas, so this flag is used
-    /// for that use case
-    pub is_adapting_schema: bool,


This is dead code from when we used the cast logic (and CastOptions) to handle Parquet type conversion.

mbutrovich · 2025-05-20T17:38:11Z

native/core/src/parquet/schema_adapter.rs

                file_idx.map_or_else(
-                    // If this field only exists in the table, and not in the file, then we know
-                    // that it's null, so just return that.
-                    || Ok(new_null_array(field.data_type(), batch_rows)),


Got rid of instantiating an entire null array in favor of a single null value for column.

spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala

mbutrovich · 2025-05-20T17:40:36Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

@@ -2327,18 +2346,18 @@ object QueryPlanSerde extends Logging with CometExprShim {
          val requiredSchema = schema2Proto(scan.requiredSchema.fields)
          val dataSchema = schema2Proto(scan.relation.dataSchema.fields)

-          val data_schema_idxs = scan.requiredSchema.fields.map(field => {
+          val dataSchemaIndexes = scan.requiredSchema.fields.map(field => {


Just fixing incorrectly formatted variables as I find them.

native/core/src/execution/planner.rs

mbutrovich · 2025-05-21T12:23:00Z

Something else to look at for Spark 3.4...

select column with default value (native_comet, native shuffle) *** FAILED *** (449 milliseconds)
  org.apache.spark.sql.AnalysisException: Failed to execute ALTER TABLE ADD COLUMNS command because the destination table column col2 has a DEFAULT value with type ByteType, but the statement provided a value of incompatible type IntegerType

codecov-commenter · 2025-05-21T12:35:19Z

Codecov Report

Attention: Patch coverage is 85.71429% with 2 lines in your changes missing coverage. Please review.

Project coverage is 58.62%. Comparing base (f09f8af) to head (916b43b).
Report is 203 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	85.71%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1756      +/-   ##
============================================
+ Coverage     56.12%   58.62%   +2.49%     
- Complexity      976     1131     +155     
============================================
  Files           119      130      +11     
  Lines         11743    12673     +930     
  Branches       2251     2367     +116     
============================================
+ Hits           6591     7429     +838     
- Misses         4012     4058      +46     
- Partials       1140     1186      +46

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove

LGTM. Thanks @mbutrovich!

mbutrovich added 5 commits May 19, 2025 18:18

Stash. Passes Andy's test.

d5d52a6

Merge branch 'main' into default_values

daefb5a

Refactor.

4134563

Refactor, don't instantiate null array for missing values.

b99e2b0

Refactor, don't instantiate null array for missing values.

70a4a0e

mbutrovich changed the title ~~fix: default values for experimental native scans~~ fix: default values for experimental native_datafusion scan May 20, 2025

mbutrovich added 4 commits May 20, 2025 13:10

Add fuzz test for default column values.

93197d6

Disable Spark's vectorized reader for fuzz test.

6ce6632

Docs.

46b4d20

Docs.

78d43be

mbutrovich marked this pull request as ready for review May 20, 2025 17:31

camelCase.

f7cbe21

mbutrovich commented May 20, 2025

View reviewed changes

andygrove reviewed May 20, 2025

View reviewed changes

spark/src/test/scala/org/apache/comet/CometFuzzTestSuite.scala Outdated Show resolved Hide resolved

mbutrovich commented May 20, 2025

View reviewed changes

mbutrovich added 2 commits May 20, 2025 13:47

More resilient column selection.

213aa2e

Fix Spark 3.4.3?

56a5bd4

mbutrovich requested a review from andygrove May 20, 2025 19:41

andygrove reviewed May 20, 2025

View reviewed changes

native/core/src/execution/planner.rs Outdated Show resolved Hide resolved

mbutrovich added 4 commits May 20, 2025 18:31

Merge branch 'main' into default_values

37b5e46

Fix after apache#1746 merged.

38b0a3e

Address PR feedback.

118cab0

Update compatibility guide.

75aa236

mbutrovich added 2 commits May 21, 2025 08:40

More Spark 3.4 fixes.

b0947e2

Undo errant compatibility guide change.

916b43b

andygrove approved these changes May 21, 2025

View reviewed changes

mbutrovich changed the title ~~fix: default values for experimental native_datafusion scan~~ fix: default values for native_datafusion scan May 21, 2025

parthchandra approved these changes May 23, 2025

View reviewed changes

andygrove merged commit 9da11c5 into apache:main May 23, 2025
79 checks passed

mbutrovich mentioned this pull request May 27, 2025

fix: fall back on nested types for default values #1799

Merged

mbutrovich deleted the default_values branch May 29, 2025 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: default values for native_datafusion scan #1756

fix: default values for native_datafusion scan #1756

mbutrovich commented May 20, 2025 •

edited

Loading

Uh oh!

mbutrovich May 20, 2025

Uh oh!

parthchandra May 22, 2025

Uh oh!

parthchandra May 23, 2025

Uh oh!

mbutrovich May 20, 2025

Uh oh!

parthchandra May 22, 2025

Uh oh!

mbutrovich May 20, 2025

Uh oh!

mbutrovich May 20, 2025

Uh oh!

Uh oh!

mbutrovich May 20, 2025

Uh oh!

Uh oh!

mbutrovich commented May 21, 2025

Uh oh!

codecov-commenter commented May 21, 2025 •

edited

Loading

Uh oh!

andygrove left a comment

Uh oh!

Uh oh!

Uh oh!

fix: default values for native_datafusion scan #1756

fix: default values for native_datafusion scan #1756

Conversation

mbutrovich commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbutrovich commented May 21, 2025

Uh oh!

codecov-commenter commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbutrovich commented May 20, 2025 •

edited

Loading

codecov-commenter commented May 21, 2025 •

edited

Loading