-
Notifications
You must be signed in to change notification settings - Fork 205
chore: Comet + Iceberg (1.8.1) CI #1715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
run: | | ||
cd apache-iceberg | ||
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups | ||
ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true ../../gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied from Iceberg Spark's CI: https://github.com/apache/iceberg/blob/apache-iceberg-1.8.1/.github/workflows/spark-ci.yml#L102-L106
with: | ||
repository: apache/iceberg | ||
path: apache-iceberg | ||
ref: apache-iceberg-${{inputs.iceberg-version}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Iceberg's release tag: https://github.com/apache/iceberg/tags
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1715 +/- ##
============================================
+ Coverage 56.12% 58.66% +2.53%
- Complexity 976 1135 +159
============================================
Files 119 129 +10
Lines 11743 12640 +897
Branches 2251 2363 +112
============================================
+ Hits 6591 7415 +824
- Misses 4012 4049 +37
- Partials 1140 1176 +36 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -986,6 +986,7 @@ under the License. | |||
<exclude>**/build/**</exclude> | |||
<exclude>**/target/**</exclude> | |||
<exclude>**/apache-spark/**</exclude> | |||
<exclude>**/apache-iceberg/**</exclude> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to exclude the iceberg repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hsiang-c
run: | | ||
cd apache-iceberg | ||
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups | ||
ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true COMET_PARQUET_SCAN_IMPL=native_datafusion ./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm do we expect native_datafusion works for iceberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Native execution doesn't work for iceberg yet. Iceberg uses ParquetReaderType
to control whether to use Comet or not. The default is ParquetReaderType.ICEBERG
. We need to set to ParquetReaderType.COMET
to turn on Comet, but currently it's for Comet reader only, not for native execution yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kazuyukitanimura @huaxingao, I will remove native_dafafusion
and native_iceberg_compact
builds for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but currently it's for Comet reader only, not for native execution yet.
Correct. I plan to work with @huaxingao once the Spark sql tests pass for native_iceberg_compat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@parthchandra Please feel free to involve me, happy to help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great @hsiang-c ! We can discuss this offline.
// Controls which Parquet reader implementation to use | ||
public static final String PARQUET_READER_TYPE = "spark.sql.iceberg.parquet.reader-type"; | ||
- public static final ParquetReaderType PARQUET_READER_TYPE_DEFAULT = ParquetReaderType.ICEBERG; | ||
+ public static final ParquetReaderType PARQUET_READER_TYPE_DEFAULT = ParquetReaderType.COMET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao I changed the default to COMET
.
} | ||
|
||
- @TestTemplate | ||
+ @Disabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao Disabled 2 unit tests b/c they fail with Comet reader.
integrationImplementation project(path: ':iceberg-hive-metastore', configuration: 'testArtifacts') | ||
integrationImplementation project(path: ":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}", configuration: 'testArtifacts') | ||
integrationImplementation project(path: ":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVersion}", configuration: 'testArtifacts') | ||
+ integrationImplementation project(path: ':iceberg-parquet') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only in Spark 3.4, I need to include iceberg-parquet
otherwise the iceberg-spark-runtime-3.4
tests throw the following errors
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 4) (17.115.161.202 executor driver): java.lang.NoSuchMethodError: 'org.apache.parquet.column.ParquetProperties$Builder org.apache.parquet.column.ParquetProperties$Builder.withBloomFilterFPP(java.lang.String, double)'
at org.apache.iceberg.parquet.Parquet$WriteBuilder.build(Parquet.java:389)
at org.apache.iceberg.parquet.Parquet$DataWriteBuilder.build(Parquet.java:787)
at org.apache.iceberg.data.BaseFileWriterFactory.newDataWriter(BaseFileWriterFactory.java:131)
at org.apache.iceberg.io.RollingDataWriter.newWriter(RollingDataWriter.java:52)
at org.apache.iceberg.io.RollingDataWriter.newWriter(RollingDataWriter.java:32)
at org.apache.iceberg.io.RollingFileWriter.openCurrentWriter(RollingFileWriter.java:108)
at org.apache.iceberg.io.RollingDataWriter.<init>(RollingDataWriter.java:47)
at org.apache.iceberg.spark.source.SparkWrite$UnpartitionedDataWriter.<init>(SparkWrite.java:701)
at org.apache.iceberg.spark.source.SparkWrite$WriterFactory.createWriter(SparkWrite.java:675)
at org.apache.iceberg.spark.source.SparkWrite$WriterFactory.createWriter(SparkWrite.java:652)
at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:459)
at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:448)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:514)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:411)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
at org.apache.spark.scheduler.Task.run(Task.scala:139)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
run: | | ||
cd apache-iceberg | ||
rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups | ||
ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true ./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENABLE_COMET is available only for the patched Spark (with the diff) through setup-spark-builder
Which Spark is combined with this Iceberg test?
Also I just realized ENABLE_COMET_SHUFFLE is not used at all looks like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazuyukitanimura You're right, I don't read both env vars with the diff I made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to update the Spark referred in Iceberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't get it.
Do you mean the build comamnd -DsparkVersions=${{ matrix.spark-version.short }}
or in the diff?
In the diff, I modified the Iceberg Spark Gradle module according to the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the spark version provided by -DsparkVersions=
is OSS Spark. do we need to let it load comet library?
Not sure if this test automatically load the Comet library in Spark referred by Iceberg...
@huaxingao ?
Which issue does this PR close?
Closes #. #1685
Rationale for this change
Run Iceberg Spark' tests as part of Comet CI
What changes are included in this PR?
ICEBERG
toCOMET
testMergeSchemaIgnoreCastingLongToInt
andtestMergeSchemaIgnoreCastingDoubleToFloat
inTestDataFrameWriterV2
for both Iceberg Spark 3.4 and Iceberg Spark 3.5How are these changes tested?
At the moment, locally: