Skip to content

Conversation

@asl3
Copy link
Contributor

@asl3 asl3 commented Oct 20, 2025

What changes were proposed in this pull request?

Add numSourceRows metric for MergeIntoExec, from source node's numOutputRows.

Assumption is that all child nodes have numOutputRows. If not found, numSourceRows would be -1.

Why are the changes needed?

Improve completeness and debuggability of Merge Into metrics.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test cases for numSourceNodes metric.

Was this patch authored or co-authored using generative AI tooling?

No.

@asl3 asl3 changed the title [SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec [SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec Oct 20, 2025
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asl3 i left some initial style comments

None
}

sourceChild.flatMap { child =>
Copy link
Member

@szehon-ho szehon-ho Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, why do we need to traverse again here? I thought join.left and join.right is already the child and we can directly check that node? We dont want to traverse as each node without numOutputRows risks a wrong information (because that node may change the numOutputRows from its child)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed to findSourceSide, as we still need a step to find the source node with numOutputRows.

For example, with:

+- *(2) BroadcastHashJoin ...
                     :- *(2) Project ... 
                     :  +- BatchScan ... 
                     +- BroadcastQueryStage ...
                        +- BroadcastExchange ... 
                           +- *(1) Project ...
                              +- *(1) LocalTableScan ...

we find BroadcastQueryStage has the source table (after checking isTargetTableScan), but still need a step to traverse for LocalTableScan. As it is collectFirst, I think we don't worry about traversing too far

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments on the tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants