Skip to content

Spark 4.1: Refactor SparkMicroBatchStream to SyncPlanner#15298

Merged
bryanck merged 9 commits intoapache:mainfrom
RjLi13:refactor-sync-planner
Feb 15, 2026
Merged

Spark 4.1: Refactor SparkMicroBatchStream to SyncPlanner#15298
bryanck merged 9 commits intoapache:mainfrom
RjLi13:refactor-sync-planner

Conversation

@RjLi13
Copy link
Contributor

@RjLi13 RjLi13 commented Feb 11, 2026

This is to prepare for changes made to introduce async planner: #15059. The full context of the feature is in there.

This first phase focuses on just moving SparkMicroBatchStream logic to SyncSparkMicroBatchPlanner and having SparkMicroBatchStream rely on SyncSparkMicroBatchPlanner. I also introduce two new classes besides Sync and interface,

  • MicroBatchUtils which shares static methods between the planners and SparkMicroBatchStream
  • BaseSparkMicroBatchPlanner which shares duplicated code that will be in future reused with async Planner

Phase 2 PR is here: #15299. For reference on the changes in phase 2, this is what that diff looks like. https://github.com/RjLi13/iceberg/pull/6/changes

No regression should be expected. Unfortunately git diff can't show the moves, but this PR should be mostly moving code around.

@RjLi13
Copy link
Contributor Author

RjLi13 commented Feb 11, 2026

cc @bryanck

@RjLi13 RjLi13 changed the title Spark: Refactor SparkMicroBatchStream to SyncSparkMicroBatchPlanner Spark 4.1: Refactor SparkMicroBatchStream to SyncPlanner Feb 11, 2026
@bryanck
Copy link
Contributor

bryanck commented Feb 11, 2026

LGTM!

@bryanck bryanck self-requested a review February 11, 2026 19:58
@bryanck
Copy link
Contributor

bryanck commented Feb 12, 2026

I know the original didn't have unit tests, but it might be nice to add a few, for some of the utility methods at least.

@RjLi13 RjLi13 force-pushed the refactor-sync-planner branch from fce3955 to 4c0c940 Compare February 12, 2026 21:14
@RjLi13
Copy link
Contributor Author

RjLi13 commented Feb 12, 2026

@bryanck Added short unit test in new test file TestMicroBatchPlanningUtils. Most functionality is already tested in TestStructuredStreaming3, these are additional sanity checks and testing UnpackedLimits which is somewhat new.

Comment on lines 84 to 85
* Get the next snapshot skiping over rewrite and delete snapshots. For Async handles nulls, sync
* will never have nulls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Async Planning yet, lets include this when we add support for Async ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify the ask, I should revert nextValidSnapshot to its moved state without the async changes and mention of async in comments, essentially removing this check here

if (curSnapshot == null) {
      StreamingOffset startingOffset =
          MicroBatchUtils.determineStartingOffset(table, readConf.streamFromTimestamp());
      LOG.debug("determineStartingOffset picked startingOffset: {}", startingOffset);
      if (StreamingOffset.START_OFFSET.equals(startingOffset)) {
        return null;
      }
      nextSnapshot = table.snapshot(startingOffset.snapshotId());
    } else {
      if (curSnapshot.snapshotId() == table.currentSnapshot().snapshotId()) {
        return null;
      }

}
// skip over rewrite and delete snapshots
while (!shouldProcess(nextSnapshot)) {
LOG.debug("Skipping snapshot: {}", nextSnapshot);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : nice to log snapshot ops type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think there's a need to log ops type since it should be logged already. Since we are logging the entire snapshot and so far BaseSnapshot is the one that implements Snapshot with a toString method including operation type

  public String toString() {
    return MoreObjects.toStringHelper(this)
        .add("id", snapshotId)
        .add("timestamp_ms", timestampMillis)
        .add("operation", operation)
        .add("summary", summary)
        .add("manifest-list", manifestListLocation)
        .add("schema-id", schemaId)
        .toString();
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it nice :)

// skip over rewrite and delete snapshots
while (!shouldProcess(nextSnapshot)) {
LOG.debug("Skipping snapshot: {}", nextSnapshot);
// if the currentSnapShot was also the mostRecentSnapshot then break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to add a comment explaining why ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old code, but I will enhance the comment with we are breaking to avoid snapshotAfter throwing exception since there is no more snapshots to process.

}

if (fromTimestamp == Long.MIN_VALUE) {
// match existing behavior and start from the oldest snapshot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why ?

Suggested change
// match existing behavior and start from the oldest snapshot
// start from the oldest snapshot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singhpk234 to clarify, are you asking about the change from null check to Long.MIN_VALUE and why determineStartingOffset takes in now primitive long?
I saw that readConf().streamFromTimestamp() returns primitive long so it didn't make sense to me why we needed to autobox it. Or was your comment about something else (elaborating on the old comment more?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment why we returning oldest snapshot

Comment on lines 62 to 63
// If snapshotSummary doesn't have SnapshotSummary.ADDED_FILES_PROP,
// iterate through addedFiles iterator to find addedFilesCount.
Copy link
Contributor

@singhpk234 singhpk234 Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this comment, it obvious, i understand its from old code :)

Comment on lines 26 to 30
List<FileScanTask> planFiles(StreamingOffset startOffset, StreamingOffset endOffset);

StreamingOffset latestOffset(StreamingOffset startOffset, ReadLimit limit);

void stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please added java docs for this ?

@RjLi13
Copy link
Contributor Author

RjLi13 commented Feb 13, 2026

@singhpk234 addressed comments, ptal, ty!

@bryanck
Copy link
Contributor

bryanck commented Feb 15, 2026

Thanks for the contribution @RjLi13 and for the review @singhpk234 !

@bryanck bryanck merged commit b6de7ac into apache:main Feb 15, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants