-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
When a cluster (in real environment or simulation) does not work as expected, it takes time to identify which component causes the problem.
Since GRV is a critical part in transaction's correctness and performance, we should consider adding unit tests to check its contract.
Correctness
- GRV should monotonically increase, even in different failure scenarios (which will be described later). A test workload can have multiple clients issues GRV and check that the versions monotonically increase per client and across clients;
Performance
- GRV latency should be similar for each client from each proxy;
- GRV throughput is expected;
- GRV performance does not degrade much (which will be quantified) when partial failure happens.
Partial failure: Failure that does not trigger master recovery.
- Network between a proxy and master or resolver is slinky. The latency on these links is higher;
- A proxy has noisy neighbor and it gets less CPU, cache and memory bandwidth resource;
If only one proxy has the partial failure, an ideal system should redirect traffic to other healthy proxies. The GRV latency should not degrade much. The GRV throughput should only decrease proportional to the number of degraded proxies.
This is orthogonal to the failure monitoring project
This issue focuses on testing and understanding if the GRV contract is uphold and how the system's GRV requests reacts to failures.
cc. @sfc-gh-kmakino @sears @yliucode