[server] Fix Offset Lag Short Circuit #1472
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR ensures that
LeaderFollowerStoreIngestionTask#reportIfCatchUpVersionTopicOffset()
does not prematurely measure and report offset lag by adding a check for latch creation, which only happens when ingestion begins for a current version. It previously only checked for!isLatchReleased()
, which is true by default, even if the latch was not created.The following scenario should be fixed:
EndOfPush
, it reportsCATCH_UP_BASE_TOPIC_OFFSET_LAG
and quickly reportsCOMPLETED
(even beforeTOPIC_SWITCH_RECEIVED
) due to the logic inLeaderFollowerStoreIngestionTask#reportIfCatchUpVersionTopicOffset()
.CATCH_UP_BASE_TOPIC_OFFSET_LAG
value would not prevent an alert.Since the scenario did not start with the replica ingesting the current version (the replica started ingesting while it was future version, and became the current version), the latch wouldn't have been created, and thus wouldn't need to be released.
onBecomeStandbyFromOffline()
ofLeaderFollowerPartitionStateModel
. This only happens when ingestion starts on a version that is the current version.reportIfCatchUpVersionTopicOffset()
ofLeaderFollowerStoreIngestionTask
must check that the latch was created before checking if it was released.TopicMetadataFetcher
when the cached value fails to update. This is to help detect a situation where the VT end offset is stale.removeIngestionCompleteFlag()
inStateModelIngestionProgressNotifier
.systemStoreResourceName
inAbstractVenicePartitionStateModelTest.java
.How was this PR tested?
testReportIfCatchUpVersionTopicOffset()
inStoreIngestionTaskTest
totestOnBecomeFollowerFromOffline()
ofVeniceLeaderFollowerStateModelTest
to verify that the latch creation is being recorded.Does this PR introduce any user-facing changes?