Don't require minimal for failpoint injection period #17825

serathius · 2024-04-19T20:26:46Z

cc @ahrtr @siyuanfoundation @fuweid @MadhavJivrajani

siyuanfoundation · 2024-04-19T21:00:03Z

tests/robustness/failpoint/failpoint.go

-	triggerTimeout               = time.Minute
-	waitBetweenFailpointTriggers = time.Second
-	failpointInjectionsCount     = 1
-	failpointInjectionsRetries   = 3


why is the retries removed? was it ever useful?

It was introduced as a band aid solution to flakes. It reduced problem a little, but there were still some issues with failpoint that retries didn't help. Over time we improved, especially after it we discovered issue with process execution.

Hard to say if removing it will expose some issues, we should see them in CI.

Checked logs form one of robustness test runs https://github.com/etcd-io/etcd/actions/runs/8781749612. Didn't find any "Failed to trigger failpoint" logs.

tests/robustness/main_test.go

MadhavJivrajani · 2024-04-20T00:53:34Z

tests/robustness/main_test.go

@@ -110,27 +110,30 @@ func (s testScenario) run(ctx context.Context, t *testing.T, lg *zap.Logger, clu
 	defer cancel()
 	g := errgroup.Group{}
 	var operationReport, watchReport []report.ClientReport
-	finishTraffic := make(chan struct{})
+	failpointInjected := make(chan failpoint.InjectionReport, 1)

 	// using baseTime time-measuring operation to get monotonic clock reading
 	// see https://github.com/golang/go/blob/master/src/time/time.go#L17
 	baseTime := time.Now()
 	ids := identity.NewIDProvider()
 	g.Go(func() error {


Also, not for this PR, but we should probably get rid of the errgroup.Group, probably a good idea to just use a wait groups and goroutines.

Why? This code was changed to use errgroup to make error handling cleaner.

Right, but iiuc errgroups are useful only if you return an error, we always return nil afaict.

Ok, makes sense.

MadhavJivrajani · 2024-04-20T00:56:04Z

tests/robustness/traffic/traffic.go

+		t.Errorf("Requiring minimal %f qps before failpoint injection for test results to be reliable, got %f qps", profile.MinimalQPS, beforeFailpointStats.QPS())
+	}
+	if afterFailpointStats.QPS() < profile.MinimalQPS {
+		t.Errorf("Requiring minimal %f qps after failpoint injection for test results to be reliable, got %f qps", profile.MinimalQPS, afterFailpointStats.QPS())


Thinking more here, do we want the tests to run if we deem it to be non-reliable? Is there value in failing fast here?

Usually errors are found when cluster transitions from unhealthy to healthy status. You want to ensure there is proper coverage of qps when failpoint is injected (See #17775) and after cluster recovers. For example when member is killed, you want to see how cluster reacts to member rejoining it.

This validation is to ensure that the test is strictly validating it's assumptions, we want to make sure that period after failpoint injection is also covered by requests. If we didn't overtime we might introduce regression. For example we could remove health check after failpoint, which could allow clusters to be left broken and never recover after failpoint injection. We need to make sure that all failpoints we introduce restore cluster to health to make sure we not only find issues during the failpoint being run (for example network split), but also consequences of it (when separated member rejoins cluster).

I want to make the validation even more stricter here, to ensure not only total qps, but guarantee per member qps. This way we detect cases where member didn't recover from failpoint, but cluster is serving data due to quorum availability.

Heh, needed to drop the validation for now. I was wrong to assume this validation will pass. Leaving a TODO for the future.

fuweid · 2024-04-20T01:58:49Z

tests/robustness/failpoint/failpoint.go

+	}
+	lg.Info("Triggering failpoint", zap.String("failpoint", failpoint.Name()))
+	start := time.Since(baseTime)
+	err = failpoint.Inject(ctx, t, lg, clus)


I think we should keep the original retry here just in case that the failpoint http call timeouts

Prefer to retry http calls and not whole failpoints. Reason is that failpoint seems simple and retryable as it's just a single Inject function. Underneath it's a multi stage stateful process, with some stages not retryiable or at least not from the begging. For example a simple KILL failpoint, first needs to kill the process, wait for it to exit, and start it back again. If we failed or wait, can we really retry and kill it again? What about failure on start?

Failpoint injection has more nuances making them not easy to blindly retry, and trying to make them externally retryiable makes the internal code unneseserly complicated. Would prefer we failpoints we able to make the decision how to handle their internal failures and whether they can retry on their own

Thanks for the comment. It sounds good to me.

serathius · 2024-04-20T08:02:37Z

Hmm, looks like validation of traffic after failpoint injection might not work as well:

2024-04-19T20:41:33.6949208Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting complete traffic	{"successes": 323, "failures": 465, "successRate": 0.4098984771573604, "period": "2.804586175s", "qps": 115.16850609876518}
2024-04-19T20:41:33.6951530Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting traffic before failure injection	{"successes": 210, "failures": 115, "successRate": 0.6461538461538462, "period": "977.189618ms", "qps": 214.9019966358259}
2024-04-19T20:41:33.6953790Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting traffic during failure injection	{"successes": 8, "failures": 108, "successRate": 0.06896551724137931, "period": "386.950683ms", "qps": 20.674469257882148}
2024-04-19T20:41:33.6955721Z     logger.go:146: 2024-04-19T20:41:33.230Z	INFO	Reporting traffic after failure injection	{"successes": 105, "failures": 242, "successRate": 0.3025936599423631, "period": "1.440445874s", "qps": 72.89409612346185}
2024-04-19T20:41:33.6957188Z     traffic.go:131: Requiring minimal 100.000000 qps after failpoint injection for test results to be reliable, got 72.894096 qps

serathius · 2024-04-20T08:10:35Z

Or maybe it just show that qps was head heavy. I mean it was mainly driven by requests before failpoint injection.
Logs show that only 30% of requests succeed after failpoint injection. That might imply that latency increases after failpoint injection. But also the success rate before failpoint was also pretty low 65%. Which implies that request timeout is too low.

I think we might need to increase request timeout enough so the success rate before failpoint 100%.

Signed-off-by: Marek Siarkowicz <[email protected]>

serathius · 2024-04-20T08:41:39Z

I think we might need to increase request timeout enough so the success rate before failpoint 100%.

Increasing the request timeout 5 times to 200ms doesn't help. At that point increasing it gives reverse effect of reducing qps. For now will drop QPS validation post failpoint injection.

ahrtr · 2024-04-20T17:43:59Z

defer to @MadhavJivrajani @siyuanfoundation @ArkaSaha30 @fuweid to review robustness test PRs.

fuweid · 2024-04-21T12:04:38Z

For now will drop QPS validation post failpoint injection.

Not sure that I understand it correct.
For the QPS profile, it seems to make sure that the failpoint is triggered under that the certain request pressure.
If so, the change looks good~.

serathius · 2024-04-21T14:48:11Z

For now will drop QPS validation post failpoint injection.

Not sure that I understand it correct. For the QPS profile, it seems to make sure that the failpoint is triggered under that the certain request pressure. If so, the change looks good~.

Yes, I tried to also have a QPS validation after failpoint is injected, but it causes too much flakiness.

fuweid

LGTM

serathius force-pushed the robustness-qps branch from ff638c0 to c9b166e Compare April 19, 2024 20:27

serathius changed the title ~~Don't require qps requirements failpoint injection period~~ Don't require minimal for failpoint injection period Apr 19, 2024

siyuanfoundation reviewed Apr 19, 2024

View reviewed changes

MadhavJivrajani reviewed Apr 20, 2024

View reviewed changes

fuweid reviewed Apr 20, 2024

View reviewed changes

serathius force-pushed the robustness-qps branch 3 times, most recently from 8e98288 to 7d02451 Compare April 20, 2024 08:08

serathius force-pushed the robustness-qps branch 3 times, most recently from 981a651 to c189e77 Compare April 20, 2024 08:31

Don't require minimal for failpoint injection period

f285330

Signed-off-by: Marek Siarkowicz <[email protected]>

serathius force-pushed the robustness-qps branch from c189e77 to f285330 Compare April 20, 2024 08:34

fuweid approved these changes Apr 22, 2024

View reviewed changes

siyuanfoundation approved these changes Apr 22, 2024

View reviewed changes

serathius merged commit 062a0ea into etcd-io:main Apr 22, 2024
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't require minimal for failpoint injection period #17825

Don't require minimal for failpoint injection period #17825

serathius commented Apr 19, 2024

siyuanfoundation Apr 19, 2024

serathius Apr 20, 2024

serathius Apr 22, 2024 •

edited

Loading

MadhavJivrajani Apr 20, 2024

serathius Apr 20, 2024

MadhavJivrajani Apr 22, 2024

serathius Apr 22, 2024

MadhavJivrajani Apr 20, 2024

serathius Apr 20, 2024

serathius Apr 20, 2024

serathius Apr 20, 2024

fuweid Apr 20, 2024

serathius Apr 20, 2024

fuweid Apr 21, 2024

serathius commented Apr 20, 2024

serathius commented Apr 20, 2024 •

edited

Loading

serathius commented Apr 20, 2024

ahrtr commented Apr 20, 2024

fuweid commented Apr 21, 2024

serathius commented Apr 21, 2024

fuweid left a comment

Don't require minimal for failpoint injection period #17825

Don't require minimal for failpoint injection period #17825

Conversation

serathius commented Apr 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

serathius commented Apr 20, 2024

serathius commented Apr 20, 2024 • edited Loading

serathius commented Apr 20, 2024

ahrtr commented Apr 20, 2024

fuweid commented Apr 21, 2024

serathius commented Apr 21, 2024

fuweid left a comment

Choose a reason for hiding this comment

serathius Apr 22, 2024 •

edited

Loading

serathius commented Apr 20, 2024 •

edited

Loading