Refactor and property test callbacks; log replicas on timeout #4030

jmckenzie-dev · 2025-04-01T17:03:16Z

Includes rewrite of Ideal CL metrics calculation

Patch by Josh McKenzie; reviewed by Jon Meredith and David Capwell for CASSANDRA-20505

- Includes rewrite of Ideal CL metrics calculation Patch by Josh McKenzie; reviewed by Jon Meredith and David Capwell for CASSANDRA-XXXXXX

src/java/org/apache/cassandra/batchlog/BatchlogManager.java

jonmeredith · 2025-04-01T20:48:13Z

src/java/org/apache/cassandra/locator/ReplicaPlan.java

+        public void checkStillAppliesTo(ClusterMetadata newMetadata)
+        {
+            stillAppliesTo(newMetadata);
+        }


Why not rename the stillAppliesTo method?

So this came in with TCM and I'm not sure the provenance, but we have:
public boolean stillAppliesTo(...) which is overridden in some child classes to actually return false in cases where it doesn't apply vs. the base that throws, and then we have usage of the method that takes the form:

if (stillAppliesTo(thing)) // looks like vestigial noop return; } // end of method

at the tail end of a method to throw if things have shifted.

My real beef here was that we had a method signature that looked like a noop from the outside because the semantics of the base version of the call are basically "return true or throw". I was going for a lightweight wrapper to shift the verb from "hey, I expect to get a boolean back", to "hey, void return. You either work or you barf."

In theory I could go deeper on the surgery to get the ReplicaPlan hierarchy away from overloading the usage patterns of this base method but... meh. Figured this semantic sugar verb wrapper would be enough to make me stop being sad.

src/java/org/apache/cassandra/service/StorageProxy.java

src/java/org/apache/cassandra/service/WriteResponseHandler.java

jonmeredith · 2025-04-01T22:10:54Z

src/java/org/apache/cassandra/service/WriteResponseHandler.java

+     * TODO: this method is brittle for its purpose of deciding when we should fail a query;
+     *       this needs to be aware of which nodes are live/down


is this TODO carried through in the refactor or should there be a JIRA here that tracks the work?

Carried through in the refactor. Probably should be a JIRA but I don't quite know the original context. Looks like it was Benedict on CASSANDRA-14404 but the method's been touched twice since then and then we have TCM in the mix, so not sure if it still applies.

jonmeredith · 2025-04-01T23:03:17Z

test/unit/org/apache/cassandra/utils/RandomHelpers.java

+    }
+
+    /**
+     * TODO: Add {@code Gen<BigInteger>} support into {@link AbstractTypeGenerators#PRIMITIVE_TYPE_DATA_GENS} and call that, similar to {@link #nextUUID}


still todo?

Yeah; added a note to make a follow up JIRA to move both nextBigInteger and nextBigDecimal into AbstractTypeGenerators.java.

test/unit/org/apache/cassandra/utils/RandomHelpers.java

jonmeredith · 2025-04-01T23:13:40Z

test/unit/org/apache/cassandra/utils/RandomHelpers.java

+        ByteBuffer bb = ByteBuffer.wrap(new byte[numBytes]);
+        for (int i = 0; i < numBytes / 4; i++)
+            bb.putInt(nextInt());


org.apache.cassandra.index.sai.SAITester.Randomization#nextBigInteger(int) is interesting, passing the random source into BigInteger to generate. It doesn't look like nextBoundedBigInteger that calls this is used.

Not here; that's used on 2 other patches that I'm working to upstream separately so it's a shared code artifact between them. 😇

I like the simplicity of the SAI impl but this impl lets us lean on RandomGenerator instead of Random which is where the ability to lean on the better period length and distribution of MersenneTwister comes in. Is it going to realistically make a big difference here? Almost certainly not. Did it let me sleep better at night when writing fuzz testing for paging across tombstones and I originally authored it?

Yes. Yes it did.

I'm receptive to the argument that we should K.I.S.S. on this one, but otherwise I'm fine leaving as is if you are.

jonmeredith · 2025-04-01T23:28:13Z

test/unit/org/apache/cassandra/net/CallbackPropertyTest.java

+        {
+            case PAXOS:
+                minRF = 2;
+                testedCL = EnumSet.of(ConsistencyLevel.SERIAL);


Won't overwriting the class level testedCLs mean the tested CLs are dependent on the order runTestMatrix is called if it isn't reset for each type?

... I believe you're right. Think we need to set that back in the NORMAL and default case to the default value of:

private EnumSet<ConsistencyLevel> defaultCLTests = EnumSet.of(ConsistencyLevel.ALL, ConsistencyLevel.QUORUM, ConsistencyLevel.ONE); private EnumSet<ConsistencyLevel> testedCL = EnumSet.of(ConsistencyLevel.ALL, ConsistencyLevel.QUORUM, ConsistencyLevel.ONE);

I originally wrote these tests as 1 property test per file but then realized I could easily co-locate them all in the same file, but that certainly shifted the implications in terms of durable static state now didn't it? 🤔

jonmeredith · 2025-04-01T23:59:46Z

test/unit/org/apache/cassandra/net/CallbackPropertyTest.java

+    {
+        runTestMatrix(CallbackType.EACH_QUORUM, () -> {
+            debugLog("---------------- CYCLE. rf: {}. nd1: {}. nd2: {}. quorum: {}", rf, dc1NodesDown, dc2NodesDown, quorum);
+            EachQuorumResponseHandler callback = new EachQuorumResponseHandler(eachQuorumPlan, null, WriteType.SIMPLE, null, requestTime);


I ran the test with coverage on and noticed the ideal CL stuff wasn't being excercised, so I tried adding

callback.trackIdealCL(writeReplicaPlan.withConsistencyLevel(ConsistencyLevel.ALL));

and found a counterexample

java.lang.AssertionError: Property falsified after 7 example(s) Smallest found falsifying value(s) :- {rf: 6, cl: ONE, nodesDown: 0} Cause was :- java.lang.AssertionError: Expected successful callback resolution but saw exception. at org.junit.Assert.fail(Assert.java:88) at org.apache.cassandra.net.CallbackPropertyTest.validateExceptionState(CallbackPropertyTest.java:773)

I'm assuming setting an ideal CL should never fail the request.

I'll take a look at this; I added the idealCL stuff last week (and by that I mean I rewrote it) and these property tests predated that change. I didn't think to update the property tests to include that aspect as well (given it's exercised at least nominally in WriteResponseHandlerTest); I'm glad you thought to check this.

Ok, I think this is because in the runMultiDCTest case, we don't actually initialize writeReplicaPlan but instead relied on eachQuorumPlan and localQuorumPlan. This is an artifact of the previous hierarchy of having an abstract base that each individual CL-based property test extended from and needing that context constructed and stored in the base to be available on each test invocation. I've tweaked it so all the write side tests regardless of CL use writeReplicaPlan, which will then not be null and then not NPE on the CallbackResponseTracker#buildDCMap call. Prelim "set idealCL and run this" is passing fine on both testEachQuorum and testLocalQuorum locally.

Open question as to whether we think we should gracefully allow passing in a null ReplicaPlan.ForWrite as an argument to that method; it's kind of nonsensical to say "build a DC map for me of this Non-Thing".

jonmeredith · 2025-04-02T17:48:17Z

Looking some more at coverage testing for the CallbackResponseTracker QT test.

It currently doesn't exercise a few aspects:

idealCL
pending endpoints during bootstrap
building a DCMap with non NetworkTopolgyStrategy -- not surprising as not exercised anywhere but EACH_QUORUM but that means it isn't tested, perhaps it should just assert that NTS is in use instead and leave anybody using it differently to extend the method.
recording duplicate responses
getting finalized failures (before or after finalization)

jmckenzie-dev · 2025-04-02T20:30:06Z

Looking some more at coverage testing for the CallbackResponseTracker QT test.

It currently doesn't exercise a few aspects:

idealCL

pending endpoints during bootstrap

building a DCMap with non NetworkTopolgyStrategy -- not surprising as not exercised anywhere but EACH_QUORUM but that means it isn't tested, perhaps it should just assert that NTS is in use instead and leave anybody using it differently to extend the method.

recording duplicate responses

getting finalized failures (before or after finalization)

I'm pro adding all of the above w/the following context:

idealCL coverage: just on or off, not necessarily checking that it's state as calculated is correct since we do that in WriteResponseHandlerTest, albeit at a reduced coverage of permutations. Since it's a metric thing I'm not too worried about property testing the hell out of it, at least as part of this PR
asserting NTS in use and blowing up if not
duplicate responses (I'd added this manually in some of the property tests at one point and then it dropped back out during rebase across... many branches. Will re-add)
finalized failures - should be able to snapshot 1-N of the callbacks during run through and then duplicate re-running them.

jmckenzie-dev added 3 commits March 31, 2025 17:18

Refactor and property test callbacks; log replicas on timeout

43f3100

- Includes rewrite of Ideal CL metrics calculation Patch by Josh McKenzie; reviewed by Jon Meredith and David Capwell for CASSANDRA-XXXXXX

Tweak to make JDK11 compat

d0c6177

Remove some leftover logging

ae87ae4

jmckenzie-dev requested a review from dcapwell April 1, 2025 20:14

jonmeredith reviewed Apr 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and property test callbacks; log replicas on timeout #4030

Refactor and property test callbacks; log replicas on timeout #4030

jmckenzie-dev commented Apr 1, 2025

jonmeredith Apr 1, 2025

jmckenzie-dev Apr 2, 2025

jonmeredith Apr 1, 2025

jmckenzie-dev Apr 2, 2025

jonmeredith Apr 1, 2025

jmckenzie-dev Apr 2, 2025

jonmeredith Apr 1, 2025

jmckenzie-dev Apr 2, 2025

jonmeredith Apr 1, 2025

jmckenzie-dev Apr 2, 2025

jonmeredith Apr 1, 2025

jmckenzie-dev Apr 2, 2025

jmckenzie-dev Apr 2, 2025

jonmeredith commented Apr 2, 2025

jmckenzie-dev commented Apr 2, 2025

		* TODO: this method is brittle for its purpose of deciding when we should fail a query;
		* this needs to be aware of which nodes are live/down

Refactor and property test callbacks; log replicas on timeout #4030

Are you sure you want to change the base?

Refactor and property test callbacks; log replicas on timeout #4030

Conversation

jmckenzie-dev commented Apr 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmeredith commented Apr 2, 2025

jmckenzie-dev commented Apr 2, 2025