Skip to content

Spark 4.1: Simplify handling of metadata columns#15297

Merged
aokolnychyi merged 2 commits intoapache:mainfrom
aokolnychyi:spark-metadata-schema-calculation
Feb 17, 2026
Merged

Spark 4.1: Simplify handling of metadata columns#15297
aokolnychyi merged 2 commits intoapache:mainfrom
aokolnychyi:spark-metadata-schema-calculation

Conversation

@aokolnychyi
Copy link
Contributor

This PR contains a subset of the changes from #15240 to simplify handling of metadata columns.

private final InMemoryMetricsReporter metricsReporter;

private Schema schema;
private Schema projection;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am later going to keep the entire schema here as well.
This is a projection, defaulted to the schema initially.

}

private Schema calculateMetadataSchema(List<Types.NestedField> metaColumnFields) {
Optional<Types.NestedField> partitionField =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this logic is not specific to Spark and it was a bit harder to navigate.

}

// collects used data field IDs across all known table schemas
private Set<Integer> allUsedFieldIds() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to track used metadata column IDs here. They start from end of INT range and can't conflict by definition. If they do, something is fundamentally wrong.

@aokolnychyi aokolnychyi force-pushed the spark-metadata-schema-calculation branch from f09135a to a32124f Compare February 11, 2026 17:30
@aokolnychyi
Copy link
Contributor Author

@szehon-ho @dramaticlly, can you check this one, please?

* @param usedIds the set of field IDs that are already in use and cannot be reused
* @return a function that maps old IDs to new IDs while resolving conflicts
*/
public static GetID reassignConflictingIds(Set<Integer> conflictingIds, Set<Integer> usedIds) {
Copy link
Member

@szehon-ho szehon-ho Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think 'conflictingIds' and 'usedIds' is confusing together.

How about 'conflictingIds' and 'allIds'. Usually conflictingIds is a subset of allIds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went for allUsedIds in this case.

private ReassignConflictingIds(Set<Integer> conflictingIds, Set<Integer> usedIds) {
this.conflictingIds = conflictingIds;
this.usedIds = usedIds;
this.nextId = new AtomicInteger(usedIds.size()); // assume sequential assignment
Copy link
Member

@szehon-ho szehon-ho Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see this is different than old code? Maybe it works in most cases, but it is a small behave change. Also makes an assumption we use it in this pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine but you are right, better to move it into a separate change. Reverted.

this.spark = spark;
this.table = table;
this.schema = schema;
this.projection = schema;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the rename

*
* @param conflictingIds the set of conflicting field IDs that should be reassigned
* @param usedIds the set of field IDs that are already in use and cannot be reused
* @return a function that maps old IDs to new IDs while resolving conflicts
Copy link
Member

@szehon-ho szehon-ho Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe 'new' and 'old' lack context, how about something like:

a function that returns the original ID unless it is in conflictingIds, in which case returns the ID it has been reassigned to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated.

@aokolnychyi aokolnychyi merged commit 4c2e60d into apache:main Feb 17, 2026
33 checks passed
}

// schema of rows that must be returned by readers
protected Schema projectionWithMetadataColumns() {
Copy link
Contributor

@dramaticlly dramaticlly Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was OOTO last week.

Curious, we changed visibility from private schemaWithMetadataColumns to protected projectionWithMetadataColumns, is it due to the bigger refactoring in #15240?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants