Skip to content

feat: retry failed transaction commit#576

Open
linguoxuan wants to merge 1 commit intoapache:mainfrom
linguoxuan:main
Open

feat: retry failed transaction commit#576
linguoxuan wants to merge 1 commit intoapache:mainfrom
linguoxuan:main

Conversation

@linguoxuan
Copy link

@linguoxuan linguoxuan commented Feb 26, 2026

This commit implements the retry for transaction commits. It introduces a generic RetryRunner utility with exponential backoff and error-kind filtering, and integrates it into Transaction::Commit() to automatically refresh table metadata and retry on commit conflicts.

@linguoxuan linguoxuan force-pushed the main branch 2 times, most recently from 82ada96 to ff6c292 Compare February 26, 2026 11:28
wgtmac

This comment was marked as outdated.

@linguoxuan linguoxuan force-pushed the main branch 5 times, most recently from 2b8fcbb to 2053566 Compare March 2, 2026 02:35
wgtmac

This comment was marked as outdated.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've carefully reviewed the retry mechanism and found a few parity issues and a structural data-loss concern regarding how pending updates are held during retries. Please see the inline comments.

bool timed_out = config_.total_timeout_ms > 0 &&
elapsed > config_.total_timeout_ms && attempt > 1;
if (attempt >= max_attempts || timed_out) {
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timed_out check requires attempt > 1. If the first execution takes longer than total_timeout_ms to fail, timed_out will be falsely evaluated as false, and the runner will erroneously proceed to sleep and execute a second attempt. Java's Tasks.java strictly validates durationMs > maxDurationMs unconditionally and aborts immediately without attempting a retry. Remove the && attempt > 1 condition.

return std::max(1, delay_ms);
}

/// \brief Sleep for the specified duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ jitter calculation uses a bidirectional spread [-jitter_range, jitter_range]. Java's Tasks.java specifically adds a strictly positive jitter: [0, delayMs * 0.1). Consider generating a strictly positive random value [0, jitter_range] to align precisely with Java.


Kind kind() const final { return Kind::kUpdateSnapshotReference; }

bool IsRetryable() const override { return false; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding IsRetryable() to explicitly return false causes Transaction::CanRetry() to fail any transaction containing branch or tag updates on conflicts. In Java, SnapshotManager.commit() utilizes transaction.commitTransaction(), which safely retries UpdateSnapshotReferencesOperation. Branch and tag creations should be retryable. Consider removing this override or returning true.

@wgtmac
Copy link
Member

wgtmac commented Mar 13, 2026

I just recall a design flaw in the interaction between PendingUpdate and Transaction and created a fix: #591. Without this fix, users have to cache all created pending update instances, otherwise they cannot retry them since they are weak_ptr in the transaction instance.

Comment on lines +167 to +168
std::optional<std::vector<ErrorKind>> only_retry_on_;
std::optional<std::vector<ErrorKind>> stop_retry_on_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just use vector here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants