chore: Comet + Iceberg (1.8.1) CI #1715

hsiang-c · 2025-05-05T00:33:41Z

Which issue does this PR close?

Closes #. #1685

Rationale for this change

Run Iceberg Spark' tests as part of Comet CI

What changes are included in this PR?

Produce a git diff bases on Iceberg version 1.8.1 (will work on other Iceberg versions (e.g. 1.9.x) later)
Change the default value of Parquet Reader Type from ICEBERG to COMET
Disable testMergeSchemaIgnoreCastingLongToInt and testMergeSchemaIgnoreCastingDoubleToFloat in TestDataFrameWriterV2 for both Iceberg Spark 3.4 and Iceberg Spark 3.5
Run Iceberg Spark's tests, based on Iceberg's GitHub workflow: https://github.com/apache/iceberg/blob/main/.github/workflows/spark-ci.yml

How are these changes tested?

At the moment, locally:

# Spark 3.5
./gradlew -DsparkVersions=3.5 -DscalaVersion=2.12 -DflinkVersions= -DkafkaVersions= :iceberg-spark:iceberg-spark-3.5_2.12:check -Pquick=true -x javadoc

BUILD SUCCESSFUL in 26m 10s
46 actionable tasks: 7 executed, 39 up-to-date

./gradlew -DsparkVersions=3.5 -DscalaVersion=2.12 -DflinkVersions= -DkafkaVersions= :iceberg-spark:iceberg-spark-extensions-3.5_2.12:check -Pquick=true -x javadoc

BUILD SUCCESSFUL in 23m 44s
52 actionable tasks: 9 executed, 4 from cache, 39 up-to-date

./gradlew -DsparkVersions=3.5 -DscalaVersion=2.12 -DflinkVersions= -DkafkaVersions= :iceberg-spark:iceberg-spark-runtime-3.5_2.12:check -Pquick=true -x javadoc

BUILD SUCCESSFUL in 15s
65 actionable tasks: 4 executed, 61 up-to-date

# Spark 3.4
./gradlew -DsparkVersions=3.4 -DscalaVersion=2.12 -DflinkVersions= -DkafkaVersions= :iceberg-spark:iceberg-spark-3.4_2.12:check -Pquick=true -x javadoc

BUILD SUCCESSFUL in 21m 32s
45 actionable tasks: 7 executed, 1 from cache, 37 up-to-date

./gradlew -DsparkVersions=3.4 -DscalaVersion=2.12 -DflinkVersions= -DkafkaVersions= :iceberg-spark:iceberg-spark-extensions-3.4_2.12:check -Pquick=true -x javadoc

BUILD SUCCESSFUL in 22m 6s
51 actionable tasks: 5 executed, 2 from cache, 44 up-to-date

./gradlew -DsparkVersions=3.4 -DscalaVersion=2.12 -DflinkVersions= -DkafkaVersions= :iceberg-spark:iceberg-spark-runtime-3.4_2.12:check -Pquick=true -x javadoc

BUILD SUCCESSFUL in 18s
64 actionable tasks: 4 executed, 1 from cache, 59 up-to-date

hsiang-c · 2025-05-05T00:34:54Z

.github/workflows/iceberg_spark_test.yml

+        run: |
+          cd apache-iceberg
+          rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
+          ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true ../../gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \


Copied from Iceberg Spark's CI: https://github.com/apache/iceberg/blob/apache-iceberg-1.8.1/.github/workflows/spark-ci.yml#L102-L106

hsiang-c · 2025-05-05T00:35:30Z

.github/actions/setup-iceberg-builder/action.yaml

+      with:
+        repository: apache/iceberg
+        path: apache-iceberg
+        ref: apache-iceberg-${{inputs.iceberg-version}}


Based on Iceberg's release tag: https://github.com/apache/iceberg/tags

codecov-commenter · 2025-05-05T18:01:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.66%. Comparing base (f09f8af) to head (aaf62c5).
Report is 183 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1715      +/-   ##
============================================
+ Coverage     56.12%   58.66%   +2.53%     
- Complexity      976     1135     +159     
============================================
  Files           119      129      +10     
  Lines         11743    12640     +897     
  Branches       2251     2363     +112     
============================================
+ Hits           6591     7415     +824     
- Misses         4012     4049      +37     
- Partials       1140     1176      +36

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hsiang-c · 2025-05-05T21:23:48Z

pom.xml

@@ -986,6 +986,7 @@ under the License.
            <exclude>**/build/**</exclude>
            <exclude>**/target/**</exclude>
            <exclude>**/apache-spark/**</exclude>
+            <exclude>**/apache-iceberg/**</exclude>


I forgot to exclude the iceberg repo

kazuyukitanimura

Thanks @hsiang-c

kazuyukitanimura · 2025-05-07T00:36:34Z

.github/workflows/iceberg_spark_test_native_datafusion.yml

+        run: |
+          cd apache-iceberg
+          rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
+          ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true COMET_PARQUET_SCAN_IMPL=native_datafusion ./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \


hmmm do we expect native_datafusion works for iceberg?

Native execution doesn't work for iceberg yet. Iceberg uses ParquetReaderType to control whether to use Comet or not. The default is ParquetReaderType.ICEBERG. We need to set to ParquetReaderType.COMET to turn on Comet, but currently it's for Comet reader only, not for native execution yet.

Thanks @kazuyukitanimura @huaxingao, I will remove native_dafafusion and native_iceberg_compact builds for now.

but currently it's for Comet reader only, not for native execution yet.

Correct. I plan to work with @huaxingao once the Spark sql tests pass for native_iceberg_compat.

@parthchandra Please feel free to involve me, happy to help here.

That's great @hsiang-c ! We can discuss this offline.

dev/diffs/iceberg/1.8.1.diff

hsiang-c · 2025-05-07T23:45:09Z

dev/diffs/iceberg/1.8.1.diff

+   // Controls which Parquet reader implementation to use
+   public static final String PARQUET_READER_TYPE = "spark.sql.iceberg.parquet.reader-type";
+-  public static final ParquetReaderType PARQUET_READER_TYPE_DEFAULT = ParquetReaderType.ICEBERG;
+  public static final ParquetReaderType PARQUET_READER_TYPE_DEFAULT = ParquetReaderType.COMET;


@huaxingao I changed the default to COMET.

hsiang-c · 2025-05-07T23:45:50Z

dev/diffs/iceberg/1.8.1.diff

+   }
+
+-  @TestTemplate
+  @Disabled


@huaxingao Disabled 2 unit tests b/c they fail with Comet reader.

hsiang-c · 2025-05-07T23:51:58Z

dev/diffs/iceberg/1.8.1.diff

+     integrationImplementation project(path: ':iceberg-hive-metastore', configuration: 'testArtifacts')
+     integrationImplementation project(path: ":iceberg-spark:iceberg-spark-${sparkMajorVersion}_${scalaVersion}", configuration: 'testArtifacts')
+     integrationImplementation project(path: ":iceberg-spark:iceberg-spark-extensions-${sparkMajorVersion}_${scalaVersion}", configuration: 'testArtifacts')
+    integrationImplementation project(path: ':iceberg-parquet')


Only in Spark 3.4, I need to include iceberg-parquet otherwise the iceberg-spark-runtime-3.4 tests throw the following errors

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 4) (17.115.161.202 executor driver): java.lang.NoSuchMethodError: 'org.apache.parquet.column.ParquetProperties$Builder org.apache.parquet.column.ParquetProperties$Builder.withBloomFilterFPP(java.lang.String, double)' at org.apache.iceberg.parquet.Parquet$WriteBuilder.build(Parquet.java:389) at org.apache.iceberg.parquet.Parquet$DataWriteBuilder.build(Parquet.java:787) at org.apache.iceberg.data.BaseFileWriterFactory.newDataWriter(BaseFileWriterFactory.java:131) at org.apache.iceberg.io.RollingDataWriter.newWriter(RollingDataWriter.java:52) at org.apache.iceberg.io.RollingDataWriter.newWriter(RollingDataWriter.java:32) at org.apache.iceberg.io.RollingFileWriter.openCurrentWriter(RollingFileWriter.java:108) at org.apache.iceberg.io.RollingDataWriter.<init>(RollingDataWriter.java:47) at org.apache.iceberg.spark.source.SparkWrite$UnpartitionedDataWriter.<init>(SparkWrite.java:701) at org.apache.iceberg.spark.source.SparkWrite$WriterFactory.createWriter(SparkWrite.java:675) at org.apache.iceberg.spark.source.SparkWrite$WriterFactory.createWriter(SparkWrite.java:652) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:459) at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:448) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:514) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:411) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)

kazuyukitanimura · 2025-05-08T06:17:17Z

.github/workflows/iceberg_spark_test.yml

+        run: |
+          cd apache-iceberg
+          rm -rf /root/.m2/repository/org/apache/parquet # somehow parquet cache requires cleanups
+          ENABLE_COMET=true ENABLE_COMET_SHUFFLE=true ./gradlew -DsparkVersions=${{ matrix.spark-version.short }} -DscalaVersion=${{ matrix.scala-version }} -DflinkVersions= -DkafkaVersions= \


ENABLE_COMET is available only for the patched Spark (with the diff) through setup-spark-builder
Which Spark is combined with this Iceberg test?
Also I just realized ENABLE_COMET_SHUFFLE is not used at all looks like

@kazuyukitanimura You're right, I don't read both env vars with the diff I made.

Do we still need to update the Spark referred in Iceberg?

@kazuyukitanimura

Sorry I don't get it.

Do you mean the build comamnd -DsparkVersions=${{ matrix.spark-version.short }} or in the diff?

In the diff, I modified the Iceberg Spark Gradle module according to the doc

Hmm, the spark version provided by -DsparkVersions= is OSS Spark. do we need to let it load comet library?
Not sure if this test automatically load the Comet library in Spark referred by Iceberg...
@huaxingao ?

chore: Comet + Iceberg (1.8.1) CI

e255a07

hsiang-c commented May 5, 2025

View reviewed changes

fix: exclude iceberg repo

d8ac05a

hsiang-c commented May 5, 2025

View reviewed changes

hsiang-c added 2 commits May 6, 2025 09:30

Don't modify /etc/hosts

8e9cb8b

Fix Gradle wrapper path

227e413

kazuyukitanimura reviewed May 7, 2025

View reviewed changes

hsiang-c added 2 commits May 7, 2025 16:35

Remove tests on native scans

28fe45a

Default ParquetReader type to Comet; disable a few tests

435f34f

hsiang-c commented May 7, 2025

View reviewed changes

kazuyukitanimura reviewed May 8, 2025

View reviewed changes

huaxingao mentioned this pull request May 8, 2025

fix: Support Schema Evolution in iceberg #1723

Closed

Remove unused env vars

aaf62c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Comet + Iceberg (1.8.1) CI #1715

chore: Comet + Iceberg (1.8.1) CI #1715

hsiang-c commented May 5, 2025 •

edited

Loading

hsiang-c May 5, 2025

hsiang-c May 5, 2025

codecov-commenter commented May 5, 2025 •

edited

Loading

hsiang-c May 5, 2025

kazuyukitanimura left a comment

kazuyukitanimura May 7, 2025

huaxingao May 7, 2025

hsiang-c May 7, 2025

parthchandra May 7, 2025

hsiang-c May 7, 2025

parthchandra May 7, 2025

hsiang-c May 7, 2025

hsiang-c May 7, 2025

hsiang-c May 7, 2025

kazuyukitanimura May 8, 2025 •

edited

Loading

hsiang-c May 8, 2025

kazuyukitanimura May 12, 2025

hsiang-c May 12, 2025

kazuyukitanimura May 13, 2025

chore: Comet + Iceberg (1.8.1) CI #1715

Are you sure you want to change the base?

chore: Comet + Iceberg (1.8.1) CI #1715

Conversation

hsiang-c commented May 5, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 5, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kazuyukitanimura May 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsiang-c commented May 5, 2025 •

edited

Loading

codecov-commenter commented May 5, 2025 •

edited

Loading

kazuyukitanimura May 8, 2025 •

edited

Loading