[fix] retry producer creation upon error after succssful topic lookup#1139
Open
zzzming wants to merge 4 commits intoapache:masterfrom
Open
[fix] retry producer creation upon error after succssful topic lookup#1139zzzming wants to merge 4 commits intoapache:masterfrom
zzzming wants to merge 4 commits intoapache:masterfrom
Conversation
Member
|
Great work @zzzming! I'll review again after you reply to the question. |
nodece
reviewed
Jan 2, 2024
pulsar/producer_partition.go
Outdated
| } | ||
| p.log.WithError(err).Error("Failed to create producer at newPartitionProducer") | ||
| errMsg := err.Error() | ||
| if strings.Contains(errMsg, errTopicNotFount) { |
Member
There was a problem hiding this comment.
Suggested change
| if strings.Contains(errMsg, errTopicNotFount) { | |
| if errors.Is(err, ErrTopicNotfound) { |
Contributor
Author
There was a problem hiding this comment.
rebase with the latest and fixed the error evaluation per your review comment
nodece
reviewed
Jan 2, 2024
pulsar/producer_partition.go
Outdated
| break | ||
| } | ||
|
|
||
| if strings.Contains(errMsg, "TopicTerminatedError") { |
Member
There was a problem hiding this comment.
Suggested change
| if strings.Contains(errMsg, "TopicTerminatedError") { | |
| if errors.Is(err, ErrTopicTerminated) { |
c676c7b to
a84c97d
Compare
Contributor
Author
|
@nodece I fixed based on your review comments. CI does not seem to run. Does it require any approval to run CI? |
|
Ci triggered |
Member
|
Ping @zzzming |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1138
Motivation
In the newPartitionProducer() function, there should be a retry of grabCnx(). It will be similar to the reconnectToBroker's grabCnx() retry logic.
Java producer has this retry logic.
At the producer creation call, after a successful topic lookup at grabCnx() in producer_partition.go, if there is a network issue before the COMMAND to create producer sent, the grabCnx() will exit without retry.
The same connectoToBroker retry logic is observed in this implementation.
We had frequent failures upon the initial producer creation under unstable network conditions .
It's tricky to reproduce. But we observe the problem more frequently on Azure pod's initialization stage. After implementing the grabCnx() retry in the newPartitionProducer(), the problem has gone away. The error often shows a connection closed (EOF) by the other side. But it's not by the broker (or Pulsar) based on the logs on the Pulsar side. It can be network issues in between the producer pod and the Pulsar cluster. That's why a grabCnx() retry is much needed.
System configuration
Pulsar version: 2.10
Modifications
In the newPartitionProducer() function, adding a retry of grabCnx() with the same retry logic specified in reconnectToBroker's grabCnx() retry logic.
Verifying this change
This change is already covered by existing tests, such as (please describe tests).
Does this pull request potentially affect one of the following parts:
If
yeswas chosen, please highlight the changesDocumentation