-
Couldn't load subscription status.
- Fork 28.9k
[SPARK-52578][SQL] Add numSourceRows metric for MergeIntoExec
#52669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
numSourceRows metric for MergeIntoExec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asl3 i left some initial style comments
...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
Outdated
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
Outdated
Show resolved
Hide resolved
...e/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala
Outdated
Show resolved
Hide resolved
| None | ||
| } | ||
|
|
||
| sourceChild.flatMap { child => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, why do we need to traverse again here? I thought join.left and join.right is already the child and we can directly check that node? We dont want to traverse as each node without numOutputRows risks a wrong information (because that node may change the numOutputRows from its child)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed to findSourceSide, as we still need a step to find the source node with numOutputRows.
For example, with:
+- *(2) BroadcastHashJoin ...
:- *(2) Project ...
: +- BatchScan ...
+- BroadcastQueryStage ...
+- BroadcastExchange ...
+- *(1) Project ...
+- *(1) LocalTableScan ...
we find BroadcastQueryStage has the source table (after checking isTargetTableScan), but still need a step to traverse for LocalTableScan. As it is collectFirst, I think we don't worry about traversing too far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments on the tests
What changes were proposed in this pull request?
Add
numSourceRowsmetric forMergeIntoExec, from source node'snumOutputRows.Assumption is that all child nodes have
numOutputRows. If not found,numSourceRowswould be -1.Why are the changes needed?
Improve completeness and debuggability of Merge Into metrics.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test cases for numSourceNodes metric.
Was this patch authored or co-authored using generative AI tooling?
No.