Skip to content

Comments

[WIP] FLIP-547: Support checkpoint during recovery#27639

Draft
1996fanrui wants to merge 18 commits intoapache:masterfrom
1996fanrui:38544/checkpointing-during-recovery
Draft

[WIP] FLIP-547: Support checkpoint during recovery#27639
1996fanrui wants to merge 18 commits intoapache:masterfrom
1996fanrui:38544/checkpointing-during-recovery

Conversation

@1996fanrui
Copy link
Member

@1996fanrui 1996fanrui commented Feb 20, 2026

Please ignore this PR first, it is not ready yet.

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@1996fanrui 1996fanrui changed the title FLIP-547: Support checkpoint during recovery [WIP] FLIP-547: Support checkpoint during recovery Feb 20, 2026
@flinkbot
Copy link
Collaborator

flinkbot commented Feb 20, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

1996fanrui and others added 9 commits February 21, 2026 16:27
…spilling strategy

Core filtering mechanism for recovered channel state buffers:
- ChannelStateFilteringHandler with per-gate GateFilterHandler
- RecordFilterContext with VirtualChannelRecordFilterFactory
- Partial data check in SequentialChannelStateReaderImpl
- Fix RecordFilterContext for Union downscale scenario
…ot for recovered buffers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r availability for recovered buffers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@1996fanrui 1996fanrui force-pushed the 38544/checkpointing-during-recovery branch from a32c5b8 to 7241b01 Compare February 21, 2026 15:53
1996fanrui and others added 7 commits February 21, 2026 17:08
…y finished

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… earlier RUNNING state transition

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ools

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[doc] Add manual review findings: overturn FullyFilledBuffer-related fix commits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[doc] Update SUMMARY_BY_COMMIT.md with final fix status for all commits

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[doc] Generate final review summary report sorted by importance

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[doc] Add verified review findings with adoption decisions

Filter 29 review points down to 6 actionable items after
code-level verification. Key finding: double persist bug in
LocalInputChannel checkpoint path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit 812481f] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit 36ab9a1] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit 12df3a8] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit f805466] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit fa5323e] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit c42a98f] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit 165c4ee] Mark review items as fixed in summary doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[update doc for commit 6638b14] Mark review items as fixed and sync design doc API references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add code review reports for checkpointing during recovery commits

Review 13 commits from 6638b14 to 812481f covering the
checkpoint during recovery feature implementation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@1996fanrui 1996fanrui force-pushed the 38544/checkpointing-during-recovery branch from 7241b01 to ccde5b1 Compare February 21, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants