Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-37008] [runtime-web] Flink UI should show the type of checkpoint (full vs incremental) #25899

Closed

Conversation

ryanvanhuuksloot
Copy link
Contributor

@ryanvanhuuksloot ryanvanhuuksloot commented Jan 6, 2025

JIRA Ticket: https://issues.apache.org/jira/browse/FLINK-37008

What is the purpose of the change

It would be useful for the UI to show if a checkpoint is full or incremental. I'm curious how others would like to expose this in the UI / API but I wanted to do a first pass to get the conversation rolling.
This PR is meant to be throwaway when we decided on a path forward.
There are currently no tests and likely fails CI

image
image


UPDATED IMAGE
image

Brief change log

Adds "full checkpoint flag" to checkpoints to be displayed in the UI / API

Verifying this change

N/A for this draft PR

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (kind of)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 6, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@ryanvanhuuksloot ryanvanhuuksloot force-pushed the rvh.ui-full-checkpoints branch from ffa5245 to c362d7a Compare January 7, 2025 22:03
Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! overall LGTM

@ryanvanhuuksloot ryanvanhuuksloot marked this pull request as ready for review January 13, 2025 22:26
@ryanvanhuuksloot
Copy link
Contributor Author

I added one test. I looked at other flags on Checkpoints and Savepoints and couldn't tests.

I also couldn't find ui tests but maybe I'm blind.

Let me know if I missed anything!

@ryanvanhuuksloot
Copy link
Contributor Author

@gaborgsomogyi If you have a chance to review given you post in the mailing list

@ryanvanhuuksloot
Copy link
Contributor Author

@flinkbot run azure

Copy link
Contributor

@Zakelly Zakelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Please find my comments below:

Comment on lines +296 to +328
private static final CheckpointProperties FULL_CHECKPOINT_NEVER_RETAINED =
new CheckpointProperties(
false,
CheckpointType.FULL_CHECKPOINT,
true,
true, // Delete on success
true, // Delete on cancellation
true, // Delete on failure
true, // Delete on suspension
false);

private static final CheckpointProperties FULL_CHECKPOINT_RETAINED_ON_FAILURE =
new CheckpointProperties(
false,
CheckpointType.FULL_CHECKPOINT,
true,
true, // Delete on success
true, // Delete on cancellation
false, // Retain on failure
true, // Delete on suspension
false);

private static final CheckpointProperties FULL_CHECKPOINT_RETAINED_ON_CANCELLATION =
new CheckpointProperties(
false,
CheckpointType.FULL_CHECKPOINT,
true,
true, // Delete on success
false, // Retain on cancellation
false, // Retain on failure
false, // Retain on suspension
false);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why those are introduced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are only used for tests
We don't really have an test suite for testing this kind of thing

I'm open to ideas for where to add tests
I've thought about
SnapshotUtilsTest
and
CheckpointPropertiesTest

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see.

I read through the CheckpointCoordinator and it seems the JM will always propagate incremental checkpoints periodically (not the ones triggered by Rest API). But actually the TM will do incremental or full ones by configuration it reads (create RocksNativeFullSnapshotStrategy or RocksIncrementalSnapshotStrategy when state backend build). Meaning that in UI will always show incremental cps even if we disable that. Am I right? If so, we should make some change in CheckpointCoordinator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree looking at the CheckpointCoordinator.

this.checkpointProperties =
CheckpointProperties.forCheckpoint(chkConfig.getCheckpointRetentionPolicy());

Specifically, the issue is that Checkpoint can be used for both full or incremental.

/** A checkpoint, full or incremental. */
public static final CheckpointType CHECKPOINT =
new CheckpointType("Checkpoint", SharingFilesStrategy.FORWARD_BACKWARD);

We'd have to move the evaluation of execution.checkpointing.incremental or change how we are determining if a Checkpoint is Full 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a possible solution is to evaluate execution.checkpointing.incremental in CheckpointCoordinator and set checkpointProperties properly. WDTY?

@ryanvanhuuksloot
Copy link
Contributor Author

@flinkbot run azure

@ryanvanhuuksloot
Copy link
Contributor Author

@flinkbot run azure

@ryanvanhuuksloot
Copy link
Contributor Author

The flink tests seem to fail on a different flaky test. Restarted to make sure.

@ryanvanhuuksloot
Copy link
Contributor Author

@Zakelly I'd love to get this in as part of Flink 2.0. I know you're working on a bunch but if you had some time 🙏
Happy to commit to fast code changes based on PR comments.

@Zakelly
Copy link
Contributor

Zakelly commented Jan 16, 2025

@Zakelly I'd love to get this in as part of Flink 2.0. I know you're working on a bunch but if you had some time 🙏 Happy to commit to fast code changes based on PR comments.

No worries. I will spare some time and we'll make it.

@ryanvanhuuksloot
Copy link
Contributor Author

@flinkbot run azure

@ryanvanhuuksloot
Copy link
Contributor Author

We realized that it is quite difficult to determine at the correct layer if a checkpoint is full when it isn't explicitly marked as full. This needs a more in-depth dive into the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants