Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't require minimal for failpoint injection period #17825

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

serathius
Copy link
Member

@serathius serathius changed the title Don't require qps requirements failpoint injection period Don't require minimal for failpoint injection period Apr 19, 2024
triggerTimeout = time.Minute
waitBetweenFailpointTriggers = time.Second
failpointInjectionsCount = 1
failpointInjectionsRetries = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the retries removed? was it ever useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was introduced as a band aid solution to flakes. It reduced problem a little, but there were still some issues with failpoint that retries didn't help. Over time we improved, especially after it we discovered issue with process execution.

Hard to say if removing it will expose some issues, we should see them in CI.

Copy link
Member Author

@serathius serathius Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked logs form one of robustness test runs https://github.com/etcd-io/etcd/actions/runs/8781749612. Didn't find any "Failed to trigger failpoint" logs.

tests/robustness/main_test.go Show resolved Hide resolved
@@ -110,27 +110,30 @@ func (s testScenario) run(ctx context.Context, t *testing.T, lg *zap.Logger, clu
defer cancel()
g := errgroup.Group{}
var operationReport, watchReport []report.ClientReport
finishTraffic := make(chan struct{})
failpointInjected := make(chan failpoint.InjectionReport, 1)

// using baseTime time-measuring operation to get monotonic clock reading
// see https://github.com/golang/go/blob/master/src/time/time.go#L17
baseTime := time.Now()
ids := identity.NewIDProvider()
g.Go(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, not for this PR, but we should probably get rid of the errgroup.Group, probably a good idea to just use a wait groups and goroutines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This code was changed to use errgroup to make error handling cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but iiuc errgroups are useful only if you return an error, we always return nil afaict.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense.

t.Errorf("Requiring minimal %f qps before failpoint injection for test results to be reliable, got %f qps", profile.MinimalQPS, beforeFailpointStats.QPS())
}
if afterFailpointStats.QPS() < profile.MinimalQPS {
t.Errorf("Requiring minimal %f qps after failpoint injection for test results to be reliable, got %f qps", profile.MinimalQPS, afterFailpointStats.QPS())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more here, do we want the tests to run if we deem it to be non-reliable? Is there value in failing fast here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually errors are found when cluster transitions from unhealthy to healthy status. You want to ensure there is proper coverage of qps when failpoint is injected (See #17775) and after cluster recovers. For example when member is killed, you want to see how cluster reacts to member rejoining it.

This validation is to ensure that the test is strictly validating it's assumptions, we want to make sure that period after failpoint injection is also covered by requests. If we didn't overtime we might introduce regression. For example we could remove health check after failpoint, which could allow clusters to be left broken and never recover after failpoint injection. We need to make sure that all failpoints we introduce restore cluster to health to make sure we not only find issues during the failpoint being run (for example network split), but also consequences of it (when separated member rejoins cluster).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make the validation even more stricter here, to ensure not only total qps, but guarantee per member qps. This way we detect cases where member didn't recover from failpoint, but cluster is serving data due to quorum availability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, needed to drop the validation for now. I was wrong to assume this validation will pass. Leaving a TODO for the future.

}
lg.Info("Triggering failpoint", zap.String("failpoint", failpoint.Name()))
start := time.Since(baseTime)
err = failpoint.Inject(ctx, t, lg, clus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the original retry here just in case that the failpoint http call timeouts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to retry http calls and not whole failpoints. Reason is that failpoint seems simple and retryable as it's just a single Inject function. Underneath it's a multi stage stateful process, with some stages not retryiable or at least not from the begging. For example a simple KILL failpoint, first needs to kill the process, wait for it to exit, and start it back again. If we failed or wait, can we really retry and kill it again? What about failure on start?

Failpoint injection has more nuances making them not easy to blindly retry, and trying to make them externally retryiable makes the internal code unneseserly complicated. Would prefer we failpoints we able to make the decision how to handle their internal failures and whether they can retry on their own

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. It sounds good to me.

@serathius
Copy link
Member Author

Hmm, looks like validation of traffic after failpoint injection might not work as well:

2024-04-19T20:41:33.6949208Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting complete traffic	{"successes": 323, "failures": 465, "successRate": 0.4098984771573604, "period": "2.804586175s", "qps": 115.16850609876518}
2024-04-19T20:41:33.6951530Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting traffic before failure injection	{"successes": 210, "failures": 115, "successRate": 0.6461538461538462, "period": "977.189618ms", "qps": 214.9019966358259}
2024-04-19T20:41:33.6953790Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting traffic during failure injection	{"successes": 8, "failures": 108, "successRate": 0.06896551724137931, "period": "386.950683ms", "qps": 20.674469257882148}
2024-04-19T20:41:33.6955721Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting traffic after failure injection	{"successes": 105, "failures": 242, "successRate": 0.3025936599423631, "period": "1.440445874s", "qps": 72.89409612346185}
2024-04-19T20:41:33.6957188Z     traffic.go:131: Requiring minimal 100.000000 qps after failpoint injection for test results to be reliable, got 72.894096 qps

@serathius serathius force-pushed the robustness-qps branch 3 times, most recently from 8e98288 to 7d02451 Compare April 20, 2024 08:08
@serathius
Copy link
Member Author

serathius commented Apr 20, 2024

Or maybe it just show that qps was head heavy. I mean it was mainly driven by requests before failpoint injection.
Logs show that only 30% of requests succeed after failpoint injection. That might imply that latency increases after failpoint injection. But also the success rate before failpoint was also pretty low 65%. Which implies that request timeout is too low.

I think we might need to increase request timeout enough so the success rate before failpoint 100%.

@serathius serathius force-pushed the robustness-qps branch 3 times, most recently from 981a651 to c189e77 Compare April 20, 2024 08:31
@serathius
Copy link
Member Author

I think we might need to increase request timeout enough so the success rate before failpoint 100%.

Increasing the request timeout 5 times to 200ms doesn't help. At that point increasing it gives reverse effect of reducing qps. For now will drop QPS validation post failpoint injection.

@ahrtr
Copy link
Member

ahrtr commented Apr 20, 2024

defer to @MadhavJivrajani @siyuanfoundation @ArkaSaha30 @fuweid to review robustness test PRs.

@fuweid
Copy link
Member

fuweid commented Apr 21, 2024

For now will drop QPS validation post failpoint injection.

Not sure that I understand it correct.
For the QPS profile, it seems to make sure that the failpoint is triggered under that the certain request pressure.
If so, the change looks good~.

@serathius
Copy link
Member Author

For now will drop QPS validation post failpoint injection.

Not sure that I understand it correct. For the QPS profile, it seems to make sure that the failpoint is triggered under that the certain request pressure. If so, the change looks good~.

Yes, I tried to also have a QPS validation after failpoint is injected, but it causes too much flakiness.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@serathius serathius merged commit 062a0ea into etcd-io:main Apr 22, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Make robustness qps requirements less fragile to CI performance
5 participants