Skip to content

Conversation

wForget
Copy link
Member

@wForget wForget commented Sep 26, 2025

Which issue does this PR close?

Closes #2457.

Rationale for this change

In the failed test case of #2444, I found that the struct type with duplicate keys will be deduplicated when converted to arrowType, which causes RowToColumanr to lose some columns.

What changes are included in this PR?

How are these changes tested?

This PR does not avoid the issue, but makes the error message more obvious. It was difficult to create a proper test case for this change until #2444 was merged.

Before this, the failing test case error in #2444 looked like:

[info] - Struct Star Expansion *** FAILED *** (2 seconds, 840 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1230.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1230.0 (TID 1549) (96a252fb2e73 executor driver): java.lang.IndexOutOfBoundsException: Index 2 out of bounds for length 2
[info] 	at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
[info] 	at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
[info] 	at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:248)
[info] 	at java.base/java.util.Objects.checkIndex(Objects.java:374)
[info] 	at java.base/java.util.ArrayList.get(ArrayList.java:459)
[info] 	at org.apache.comet.vector.CometStructVector.getChild(CometStructVector.java:55)
[info] 	at org.apache.spark.sql.vectorized.ColumnarRow.getInt(ColumnarRow.java:113)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
[info] 	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
[info] 	at scala.collection.Iterator$$anon$11.next(Iterator.scala:496)

after this:

[info] - Struct Star Expansion *** FAILED *** (2 seconds, 882 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1325.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1325.0 (TID 1477) (6a0022271dbb executor driver): org.apache.spark.SparkException: Duplicated field names in Arrow Struct are not allowed, got [a, a, b, b].
[info] 	at org.apache.spark.sql.comet.util.Utils$.toArrowField(Utils.scala:161)
[info] 	at org.apache.spark.sql.comet.util.Utils$.$anonfun$toArrowSchema$1(Utils.scala:199)
[info] 	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
[info] 	at scala.collection.Iterator.foreach(Iterator.scala:943)
[info] 	at scala.collection.Iterator.foreach$(Iterator.scala:943)
[info] 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
[info] 	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info] 	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

@codecov-commenter
Copy link

codecov-commenter commented Sep 26, 2025

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.47%. Comparing base (f09f8af) to head (b10d422).
⚠️ Report is 544 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/spark/sql/comet/util/Utils.scala 0.00% 5 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2459      +/-   ##
============================================
+ Coverage     56.12%   58.47%   +2.34%     
- Complexity      976     1440     +464     
============================================
  Files           119      146      +27     
  Lines         11743    13511    +1768     
  Branches       2251     2350      +99     
============================================
+ Hits           6591     7900    +1309     
- Misses         4012     4379     +367     
- Partials       1140     1232      +92     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@comphead
Copy link
Contributor

comphead commented Oct 1, 2025

@wForget will this PR close also #2256

@wForget
Copy link
Member Author

wForget commented Oct 2, 2025

@wForget will this PR close also #2256

No, this PR only fails prematurely when the schema has duplicate keys during r2c, and does not support struct with duplicate keys for comet.

fieldType,
Seq(toArrowField("element", elementType, containsNull, timeZoneId)).asJava)
case StructType(fields) =>
case st @ StructType(fields) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a test case to avoid regression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check if struct type has duplicate keys when converting to ArrowSchema
4 participants