-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't require minimal for failpoint injection period #17825
Conversation
ff638c0
to
c9b166e
Compare
triggerTimeout = time.Minute | ||
waitBetweenFailpointTriggers = time.Second | ||
failpointInjectionsCount = 1 | ||
failpointInjectionsRetries = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the retries removed? was it ever useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was introduced as a band aid solution to flakes. It reduced problem a little, but there were still some issues with failpoint that retries didn't help. Over time we improved, especially after it we discovered issue with process execution.
Hard to say if removing it will expose some issues, we should see them in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked logs form one of robustness test runs https://github.com/etcd-io/etcd/actions/runs/8781749612. Didn't find any "Failed to trigger failpoint" logs.
@@ -110,27 +110,30 @@ func (s testScenario) run(ctx context.Context, t *testing.T, lg *zap.Logger, clu | |||
defer cancel() | |||
g := errgroup.Group{} | |||
var operationReport, watchReport []report.ClientReport | |||
finishTraffic := make(chan struct{}) | |||
failpointInjected := make(chan failpoint.InjectionReport, 1) | |||
|
|||
// using baseTime time-measuring operation to get monotonic clock reading | |||
// see https://github.com/golang/go/blob/master/src/time/time.go#L17 | |||
baseTime := time.Now() | |||
ids := identity.NewIDProvider() | |||
g.Go(func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not for this PR, but we should probably get rid of the errgroup.Group, probably a good idea to just use a wait groups and goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? This code was changed to use errgroup to make error handling cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but iiuc errgroups are useful only if you return an error, we always return nil afaict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense.
tests/robustness/traffic/traffic.go
Outdated
t.Errorf("Requiring minimal %f qps before failpoint injection for test results to be reliable, got %f qps", profile.MinimalQPS, beforeFailpointStats.QPS()) | ||
} | ||
if afterFailpointStats.QPS() < profile.MinimalQPS { | ||
t.Errorf("Requiring minimal %f qps after failpoint injection for test results to be reliable, got %f qps", profile.MinimalQPS, afterFailpointStats.QPS()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more here, do we want the tests to run if we deem it to be non-reliable? Is there value in failing fast here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually errors are found when cluster transitions from unhealthy to healthy status. You want to ensure there is proper coverage of qps when failpoint is injected (See #17775) and after cluster recovers. For example when member is killed, you want to see how cluster reacts to member rejoining it.
This validation is to ensure that the test is strictly validating it's assumptions, we want to make sure that period after failpoint injection is also covered by requests. If we didn't overtime we might introduce regression. For example we could remove health check after failpoint, which could allow clusters to be left broken and never recover after failpoint injection. We need to make sure that all failpoints we introduce restore cluster to health to make sure we not only find issues during the failpoint being run (for example network split), but also consequences of it (when separated member rejoins cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make the validation even more stricter here, to ensure not only total qps, but guarantee per member qps. This way we detect cases where member didn't recover from failpoint, but cluster is serving data due to quorum availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, needed to drop the validation for now. I was wrong to assume this validation will pass. Leaving a TODO for the future.
} | ||
lg.Info("Triggering failpoint", zap.String("failpoint", failpoint.Name())) | ||
start := time.Since(baseTime) | ||
err = failpoint.Inject(ctx, t, lg, clus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the original retry here just in case that the failpoint http call timeouts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to retry http calls and not whole failpoints. Reason is that failpoint seems simple and retryable as it's just a single Inject function. Underneath it's a multi stage stateful process, with some stages not retryiable or at least not from the begging. For example a simple KILL failpoint, first needs to kill the process, wait for it to exit, and start it back again. If we failed or wait, can we really retry and kill it again? What about failure on start?
Failpoint injection has more nuances making them not easy to blindly retry, and trying to make them externally retryiable makes the internal code unneseserly complicated. Would prefer we failpoints we able to make the decision how to handle their internal failures and whether they can retry on their own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. It sounds good to me.
Hmm, looks like validation of traffic after failpoint injection might not work as well:
|
8e98288
to
7d02451
Compare
Or maybe it just show that qps was head heavy. I mean it was mainly driven by requests before failpoint injection. I think we might need to increase request timeout enough so the success rate before failpoint 100%. |
981a651
to
c189e77
Compare
Signed-off-by: Marek Siarkowicz <[email protected]>
c189e77
to
f285330
Compare
Increasing the request timeout 5 times to 200ms doesn't help. At that point increasing it gives reverse effect of reducing qps. For now will drop QPS validation post failpoint injection. |
defer to @MadhavJivrajani @siyuanfoundation @ArkaSaha30 @fuweid to review robustness test PRs. |
Not sure that I understand it correct. |
Yes, I tried to also have a QPS validation after failpoint is injected, but it causes too much flakiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #17775
cc @ahrtr @siyuanfoundation @fuweid @MadhavJivrajani