Spark 4.1: Refactor SparkMicroBatchStream to SyncPlanner by RjLi13 · Pull Request #15298 · apache/iceberg

RjLi13 · 2026-02-11T19:44:40Z

This is to prepare for changes made to introduce async planner: #15059. The full context of the feature is in there.

This first phase focuses on just moving SparkMicroBatchStream logic to SyncSparkMicroBatchPlanner and having SparkMicroBatchStream rely on SyncSparkMicroBatchPlanner. I also introduce two new classes besides Sync and interface,

MicroBatchUtils which shares static methods between the planners and SparkMicroBatchStream
BaseSparkMicroBatchPlanner which shares duplicated code that will be in future reused with async Planner

Phase 2 PR is here: #15299. For reference on the changes in phase 2, this is what that diff looks like. https://github.com/RjLi13/iceberg/pull/6/changes

No regression should be expected. Unfortunately git diff can't show the moves, but this PR should be mostly moving code around.

RjLi13 · 2026-02-11T19:50:21Z

cc @bryanck

bryanck · 2026-02-11T19:57:43Z

LGTM!

bryanck · 2026-02-12T14:30:10Z

I know the original didn't have unit tests, but it might be nice to add a few, for some of the utility methods at least.

Fix Spotless

RjLi13 · 2026-02-12T21:22:11Z

@bryanck Added short unit test in new test file TestMicroBatchPlanningUtils. Most functionality is already tested in TestStructuredStreaming3, these are additional sanity checks and testing UnpackedLimits which is somewhat new.

singhpk234 · 2026-02-13T09:35:52Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseSparkMicroBatchPlanner.java

+   * Get the next snapshot skiping over rewrite and delete snapshots. For Async handles nulls, sync
+   * will never have nulls


There is no Async Planning yet, lets include this when we add support for Async ?

To clarify the ask, I should revert nextValidSnapshot to its moved state without the async changes and mention of async in comments, essentially removing this check here

if (curSnapshot == null) { StreamingOffset startingOffset = MicroBatchUtils.determineStartingOffset(table, readConf.streamFromTimestamp()); LOG.debug("determineStartingOffset picked startingOffset: {}", startingOffset); if (StreamingOffset.START_OFFSET.equals(startingOffset)) { return null; } nextSnapshot = table.snapshot(startingOffset.snapshotId()); } else { if (curSnapshot.snapshotId() == table.currentSnapshot().snapshotId()) { return null; }

singhpk234 · 2026-02-13T09:37:15Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseSparkMicroBatchPlanner.java

+    }
+    // skip over rewrite and delete snapshots
+    while (!shouldProcess(nextSnapshot)) {
+      LOG.debug("Skipping snapshot: {}", nextSnapshot);


minor : nice to log snapshot ops type

Actually I don't think there's a need to log ops type since it should be logged already. Since we are logging the entire snapshot and so far BaseSnapshot is the one that implements Snapshot with a toString method including operation type

public String toString() { return MoreObjects.toStringHelper(this) .add("id", snapshotId) .add("timestamp_ms", timestampMillis) .add("operation", operation) .add("summary", summary) .add("manifest-list", manifestListLocation) .add("schema-id", schemaId) .toString(); }

In that case it nice :)

singhpk234 · 2026-02-13T09:38:07Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseSparkMicroBatchPlanner.java

+    // skip over rewrite and delete snapshots
+    while (!shouldProcess(nextSnapshot)) {
+      LOG.debug("Skipping snapshot: {}", nextSnapshot);
+      // if the currentSnapShot was also the mostRecentSnapshot then break


it would be nice to add a comment explaining why ?

This is old code, but I will enhance the comment with we are breaking to avoid snapshotAfter throwing exception since there is no more snapshots to process.

singhpk234 · 2026-02-13T09:39:10Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/MicroBatchUtils.java

+    }
+
+    if (fromTimestamp == Long.MIN_VALUE) {
+      // match existing behavior and start from the oldest snapshot


Explain why ?

Suggested change

// match existing behavior and start from the oldest snapshot

// start from the oldest snapshot

@singhpk234 to clarify, are you asking about the change from null check to Long.MIN_VALUE and why determineStartingOffset takes in now primitive long?
I saw that readConf().streamFromTimestamp() returns primitive long so it didn't make sense to me why we needed to autobox it. Or was your comment about something else (elaborating on the old comment more?)

added a comment why we returning oldest snapshot

singhpk234 · 2026-02-13T09:40:26Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/MicroBatchUtils.java

+    // If snapshotSummary doesn't have SnapshotSummary.ADDED_FILES_PROP,
+    // iterate through addedFiles iterator to find addedFilesCount.


Lets remove this comment, it obvious, i understand its from old code :)

singhpk234 · 2026-02-13T09:45:15Z

spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchPlanner.java

+  List<FileScanTask> planFiles(StreamingOffset startOffset, StreamingOffset endOffset);
+
+  StreamingOffset latestOffset(StreamingOffset startOffset, ReadLimit limit);
+
+  void stop();


can you please added java docs for this ?

RjLi13 · 2026-02-13T23:12:30Z

@singhpk234 addressed comments, ptal, ty!

bryanck · 2026-02-15T02:48:27Z

Thanks for the contribution @RjLi13 and for the review @singhpk234 !

Ruijing Li added 6 commits February 11, 2026 01:34

Refactor SparkMicroBatchStream to SyncSparkMicroBatchPlanner

314d657

Remove unnecessary rename for planFiles args

7da8708

Align planner init closer to refactored feature version

d49b907

remove unnecessary diffs

ab7ab2b

Move Microbatchutils to this PR for refactor

b8d1a26

Add BasePlanner

4f5e294

github-actions bot added the spark label Feb 11, 2026

RjLi13 mentioned this pull request Feb 11, 2026

Spark 4.1: New Async Spark Micro Batch Planner #15299

Open

RjLi13 changed the title ~~Spark: Refactor SparkMicroBatchStream to SyncSparkMicroBatchPlanner~~ Spark 4.1: Refactor SparkMicroBatchStream to SyncPlanner Feb 11, 2026

RjLi13 mentioned this pull request Feb 11, 2026

Spark: Async Spark Micro Batch Planner #15059

Closed

bryanck self-requested a review February 11, 2026 19:58

bryanck approved these changes Feb 11, 2026

View reviewed changes

bryanck requested review from aokolnychyi and singhpk234 February 11, 2026 20:06

Add new unit tests for util methods

4c0c940

Fix Spotless

RjLi13 force-pushed the refactor-sync-planner branch from fce3955 to 4c0c940 Compare February 12, 2026 21:14

singhpk234 reviewed Feb 13, 2026

View reviewed changes

Ruijing Li added 2 commits February 13, 2026 14:26

Address certain pr review comments

b56c97c

Address rest of pr comments

477e04c

singhpk234 approved these changes Feb 14, 2026

View reviewed changes

bryanck merged commit b6de7ac into apache:main Feb 15, 2026
22 checks passed

		* Get the next snapshot skiping over rewrite and delete snapshots. For Async handles nulls, sync
		* will never have nulls

	// match existing behavior and start from the oldest snapshot
	// start from the oldest snapshot

		// If snapshotSummary doesn't have SnapshotSummary.ADDED_FILES_PROP,
		// iterate through addedFiles iterator to find addedFilesCount.

Conversation

RjLi13 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RjLi13 commented Feb 11, 2026

Uh oh!

bryanck commented Feb 11, 2026

Uh oh!

bryanck commented Feb 12, 2026

Uh oh!

RjLi13 commented Feb 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

singhpk234 Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RjLi13 commented Feb 13, 2026

Uh oh!

bryanck commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RjLi13 commented Feb 11, 2026 •

edited

Loading

singhpk234 Feb 13, 2026 •

edited

Loading