Spark 4.1: Simplify handling of metadata columns by aokolnychyi · Pull Request #15297 · apache/iceberg

aokolnychyi · 2026-02-11T17:02:37Z

This PR contains a subset of the changes from #15240 to simplify handling of metadata columns.

aokolnychyi · 2026-02-11T17:05:25Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

  private final InMemoryMetricsReporter metricsReporter;

-  private Schema schema;
+  private Schema projection;


I am later going to keep the entire schema here as well.
This is a projection, defaulted to the schema initially.

aokolnychyi · 2026-02-11T17:06:28Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

-  }
-
-  private Schema calculateMetadataSchema(List<Types.NestedField> metaColumnFields) {
-    Optional<Types.NestedField> partitionField =


A lot of this logic is not specific to Spark and it was a bit harder to navigate.

aokolnychyi · 2026-02-11T17:07:56Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+  }
+
+  // collects used data field IDs across all known table schemas
+  private Set<Integer> allUsedFieldIds() {


We don't need to track used metadata column IDs here. They start from end of INT range and can't conflict by definition. If they do, something is fundamentally wrong.

aokolnychyi · 2026-02-11T22:38:00Z

@szehon-ho @dramaticlly, can you check this one, please?

szehon-ho · 2026-02-13T21:58:24Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+   * @param usedIds the set of field IDs that are already in use and cannot be reused
+   * @return a function that maps old IDs to new IDs while resolving conflicts
+   */
+  public static GetID reassignConflictingIds(Set<Integer> conflictingIds, Set<Integer> usedIds) {


i think 'conflictingIds' and 'usedIds' is confusing together.

How about 'conflictingIds' and 'allIds'. Usually conflictingIds is a subset of allIds?

I went for allUsedIds in this case.

szehon-ho · 2026-02-13T21:58:43Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+    private ReassignConflictingIds(Set<Integer> conflictingIds, Set<Integer> usedIds) {
+      this.conflictingIds = conflictingIds;
+      this.usedIds = usedIds;
+      this.nextId = new AtomicInteger(usedIds.size()); // assume sequential assignment


i see this is different than old code? Maybe it works in most cases, but it is a small behave change. Also makes an assumption we use it in this pattern.

I think it should be fine but you are right, better to move it into a separate change. Reverted.

szehon-ho · 2026-02-13T22:09:07Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

    this.spark = spark;
    this.table = table;
-    this.schema = schema;
+    this.projection = schema;


i like the rename

szehon-ho · 2026-02-13T22:12:47Z

api/src/main/java/org/apache/iceberg/types/TypeUtil.java

+   *
+   * @param conflictingIds the set of conflicting field IDs that should be reassigned
+   * @param usedIds the set of field IDs that are already in use and cannot be reused
+   * @return a function that maps old IDs to new IDs while resolving conflicts


maybe 'new' and 'old' lack context, how about something like:

a function that returns the original ID unless it is in conflictingIds, in which case returns the ID it has been reassigned to.

Agreed, updated.

dramaticlly · 2026-02-17T05:09:55Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

+  }
+
+  // schema of rows that must be returned by readers
+  protected Schema projectionWithMetadataColumns() {


Sorry, I was OOTO last week.

Curious, we changed visibility from private schemaWithMetadataColumns to protected projectionWithMetadataColumns, is it due to the bigger refactoring in #15240?

github-actions bot added API spark labels Feb 11, 2026

aokolnychyi commented Feb 11, 2026

View reviewed changes

Spark 4.1: Simplify handling of metadata columns

a32124f

aokolnychyi force-pushed the spark-metadata-schema-calculation branch from f09135a to a32124f Compare February 11, 2026 17:30

szehon-ho reviewed Feb 13, 2026

View reviewed changes

szehon-ho approved these changes Feb 13, 2026

View reviewed changes

Feedback

e586915

aokolnychyi merged commit 4c2e60d into apache:main Feb 17, 2026
33 checks passed

dramaticlly reviewed Feb 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark 4.1: Simplify handling of metadata columns#15297

Spark 4.1: Simplify handling of metadata columns#15297
aokolnychyi merged 2 commits intoapache:mainfrom
aokolnychyi:spark-metadata-schema-calculation

aokolnychyi commented Feb 11, 2026

Uh oh!

aokolnychyi Feb 11, 2026

Uh oh!

aokolnychyi Feb 11, 2026

Uh oh!

aokolnychyi Feb 11, 2026

Uh oh!

aokolnychyi commented Feb 11, 2026

Uh oh!

szehon-ho Feb 13, 2026 •

edited

Loading

Uh oh!

aokolnychyi Feb 16, 2026

Uh oh!

szehon-ho Feb 13, 2026 •

edited

Loading

Uh oh!

aokolnychyi Feb 16, 2026

Uh oh!

szehon-ho Feb 13, 2026

Uh oh!

szehon-ho Feb 13, 2026 •

edited

Loading

Uh oh!

aokolnychyi Feb 16, 2026

Uh oh!

Uh oh!

dramaticlly Feb 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aokolnychyi commented Feb 11, 2026

Uh oh!

aokolnychyi Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Feb 11, 2026

Uh oh!

szehon-ho Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

szehon-ho Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dramaticlly Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho Feb 13, 2026 •

edited

Loading

szehon-ho Feb 13, 2026 •

edited

Loading

szehon-ho Feb 13, 2026 •

edited

Loading

dramaticlly Feb 17, 2026 •

edited

Loading