Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Fix Offset Lag Short Circuit #1472

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

KaiSernLim
Copy link
Contributor

Summary

This PR ensures that LeaderFollowerStoreIngestionTask#reportIfCatchUpVersionTopicOffset() does not prematurely measure and report offset lag by adding a check for latch creation, which only happens when ingestion begins for a current version. It previously only checked for !isLatchReleased(), which is true by default, even if the latch was not created.

The following scenario should be fixed:

  1. Two replicas are ready-to-serve, but the third replica is ingesting very slowly.
  2. The push job completes with two RTS replicas, and the future version becomes the current version.
  3. Replica 3 has stale information of the version topic end offset. After EndOfPush, it reports CATCH_UP_BASE_TOPIC_OFFSET_LAG and quickly reports COMPLETED (even before TOPIC_SWITCH_RECEIVED) due to the logic in LeaderFollowerStoreIngestionTask#reportIfCatchUpVersionTopicOffset().
  4. The router thinks replica 3 is also ready-to-serve while having a tremendous follower offset lag for this partition. Even deploying new versions with a modified CATCH_UP_BASE_TOPIC_OFFSET_LAG value would not prevent an alert.

Since the scenario did not start with the replica ingesting the current version (the replica started ingesting while it was future version, and became the current version), the latch wouldn't have been created, and thus wouldn't need to be released.

  • [functional] Added a function to record when the latch is created in onBecomeStandbyFromOffline() of LeaderFollowerPartitionStateModel. This only happens when ingestion starts on a version that is the current version. reportIfCatchUpVersionTopicOffset() of LeaderFollowerStoreIngestionTask must check that the latch was created before checking if it was released.
  • [log] Added warning message in TopicMetadataFetcher when the cached value fails to update. This is to help detect a situation where the VT end offset is stale.
  • [misc] Removed unused function removeIngestionCompleteFlag() in StateModelIngestionProgressNotifier.
  • [misc] Minor variable name fix for systemStoreResourceName in AbstractVenicePartitionStateModelTest.java.

How was this PR tested?

  • [unit test] Added unit test testReportIfCatchUpVersionTopicOffset() in StoreIngestionTaskTest to
  • [unit test] Added check in testOnBecomeFollowerFromOffline() of VeniceLeaderFollowerStateModelTest to verify that the latch creation is being recorded.

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant