Spark 4.1: Simplify handling of metadata columns#15297
Spark 4.1: Simplify handling of metadata columns#15297aokolnychyi merged 2 commits intoapache:mainfrom
Conversation
| private final InMemoryMetricsReporter metricsReporter; | ||
|
|
||
| private Schema schema; | ||
| private Schema projection; |
There was a problem hiding this comment.
I am later going to keep the entire schema here as well.
This is a projection, defaulted to the schema initially.
| } | ||
|
|
||
| private Schema calculateMetadataSchema(List<Types.NestedField> metaColumnFields) { | ||
| Optional<Types.NestedField> partitionField = |
There was a problem hiding this comment.
A lot of this logic is not specific to Spark and it was a bit harder to navigate.
| } | ||
|
|
||
| // collects used data field IDs across all known table schemas | ||
| private Set<Integer> allUsedFieldIds() { |
There was a problem hiding this comment.
We don't need to track used metadata column IDs here. They start from end of INT range and can't conflict by definition. If they do, something is fundamentally wrong.
f09135a to
a32124f
Compare
|
@szehon-ho @dramaticlly, can you check this one, please? |
| * @param usedIds the set of field IDs that are already in use and cannot be reused | ||
| * @return a function that maps old IDs to new IDs while resolving conflicts | ||
| */ | ||
| public static GetID reassignConflictingIds(Set<Integer> conflictingIds, Set<Integer> usedIds) { |
There was a problem hiding this comment.
i think 'conflictingIds' and 'usedIds' is confusing together.
How about 'conflictingIds' and 'allIds'. Usually conflictingIds is a subset of allIds?
There was a problem hiding this comment.
I went for allUsedIds in this case.
| private ReassignConflictingIds(Set<Integer> conflictingIds, Set<Integer> usedIds) { | ||
| this.conflictingIds = conflictingIds; | ||
| this.usedIds = usedIds; | ||
| this.nextId = new AtomicInteger(usedIds.size()); // assume sequential assignment |
There was a problem hiding this comment.
i see this is different than old code? Maybe it works in most cases, but it is a small behave change. Also makes an assumption we use it in this pattern.
There was a problem hiding this comment.
I think it should be fine but you are right, better to move it into a separate change. Reverted.
| this.spark = spark; | ||
| this.table = table; | ||
| this.schema = schema; | ||
| this.projection = schema; |
| * | ||
| * @param conflictingIds the set of conflicting field IDs that should be reassigned | ||
| * @param usedIds the set of field IDs that are already in use and cannot be reused | ||
| * @return a function that maps old IDs to new IDs while resolving conflicts |
There was a problem hiding this comment.
maybe 'new' and 'old' lack context, how about something like:
a function that returns the original ID unless it is in conflictingIds, in which case returns the ID it has been reassigned to.
There was a problem hiding this comment.
Agreed, updated.
| } | ||
|
|
||
| // schema of rows that must be returned by readers | ||
| protected Schema projectionWithMetadataColumns() { |
There was a problem hiding this comment.
Sorry, I was OOTO last week.
Curious, we changed visibility from private schemaWithMetadataColumns to protected projectionWithMetadataColumns, is it due to the bigger refactoring in #15240?
This PR contains a subset of the changes from #15240 to simplify handling of metadata columns.