Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and property test callbacks; log replicas on timeout #4030

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

jmckenzie-dev
Copy link
Contributor

  • Includes rewrite of Ideal CL metrics calculation

Patch by Josh McKenzie; reviewed by Jon Meredith and David Capwell for CASSANDRA-20505

- Includes rewrite of Ideal CL metrics calculation

Patch by Josh McKenzie; reviewed by Jon Meredith and David Capwell for CASSANDRA-XXXXXX
@jmckenzie-dev jmckenzie-dev requested a review from dcapwell April 1, 2025 20:14
Comment on lines +369 to +372
public void checkStillAppliesTo(ClusterMetadata newMetadata)
{
stillAppliesTo(newMetadata);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not rename the stillAppliesTo method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this came in with TCM and I'm not sure the provenance, but we have:
public boolean stillAppliesTo(...) which is overridden in some child classes to actually return false in cases where it doesn't apply vs. the base that throws, and then we have usage of the method that takes the form:

if (stillAppliesTo(thing)) // looks like vestigial noop
   return;
} // end of method

at the tail end of a method to throw if things have shifted.

My real beef here was that we had a method signature that looked like a noop from the outside because the semantics of the base version of the call are basically "return true or throw". I was going for a lightweight wrapper to shift the verb from "hey, I expect to get a boolean back", to "hey, void return. You either work or you barf."

In theory I could go deeper on the surgery to get the ReplicaPlan hierarchy away from overloading the usage patterns of this base method but... meh. Figured this semantic sugar verb wrapper would be enough to make me stop being sad.

Comment on lines +211 to +212
* TODO: this method is brittle for its purpose of deciding when we should fail a query;
* this needs to be aware of which nodes are live/down
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this TODO carried through in the refactor or should there be a JIRA here that tracks the work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Carried through in the refactor. Probably should be a JIRA but I don't quite know the original context. Looks like it was Benedict on CASSANDRA-14404 but the method's been touched twice since then and then we have TCM in the mix, so not sure if it still applies.

}

/**
* TODO: Add {@code Gen<BigInteger>} support into {@link AbstractTypeGenerators#PRIMITIVE_TYPE_DATA_GENS} and call that, similar to {@link #nextUUID}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah; added a note to make a follow up JIRA to move both nextBigInteger and nextBigDecimal into AbstractTypeGenerators.java.

Comment on lines +138 to +140
ByteBuffer bb = ByteBuffer.wrap(new byte[numBytes]);
for (int i = 0; i < numBytes / 4; i++)
bb.putInt(nextInt());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.cassandra.index.sai.SAITester.Randomization#nextBigInteger(int) is interesting, passing the random source into BigInteger to generate. It doesn't look like nextBoundedBigInteger that calls this is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here; that's used on 2 other patches that I'm working to upstream separately so it's a shared code artifact between them. 😇

I like the simplicity of the SAI impl but this impl lets us lean on RandomGenerator instead of Random which is where the ability to lean on the better period length and distribution of MersenneTwister comes in. Is it going to realistically make a big difference here? Almost certainly not. Did it let me sleep better at night when writing fuzz testing for paging across tombstones and I originally authored it?

Yes. Yes it did.

I'm receptive to the argument that we should K.I.S.S. on this one, but otherwise I'm fine leaving as is if you are.

{
case PAXOS:
minRF = 2;
testedCL = EnumSet.of(ConsistencyLevel.SERIAL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't overwriting the class level testedCLs mean the tested CLs are dependent on the order runTestMatrix is called if it isn't reset for each type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I believe you're right. Think we need to set that back in the NORMAL and default case to the default value of:

    private EnumSet<ConsistencyLevel> defaultCLTests = EnumSet.of(ConsistencyLevel.ALL, ConsistencyLevel.QUORUM, ConsistencyLevel.ONE);
    private EnumSet<ConsistencyLevel> testedCL = EnumSet.of(ConsistencyLevel.ALL, ConsistencyLevel.QUORUM, ConsistencyLevel.ONE);

I originally wrote these tests as 1 property test per file but then realized I could easily co-locate them all in the same file, but that certainly shifted the implications in terms of durable static state now didn't it? 🤔

{
runTestMatrix(CallbackType.EACH_QUORUM, () -> {
debugLog("---------------- CYCLE. rf: {}. nd1: {}. nd2: {}. quorum: {}", rf, dc1NodesDown, dc2NodesDown, quorum);
EachQuorumResponseHandler callback = new EachQuorumResponseHandler(eachQuorumPlan, null, WriteType.SIMPLE, null, requestTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the test with coverage on and noticed the ideal CL stuff wasn't being excercised, so I tried adding

            callback.trackIdealCL(writeReplicaPlan.withConsistencyLevel(ConsistencyLevel.ALL));

and found a counterexample

java.lang.AssertionError: Property falsified after 7 example(s) 
Smallest found falsifying value(s) :-
{rf: 6, cl: ONE, nodesDown: 0}
Cause was :-
java.lang.AssertionError: Expected successful callback resolution but saw exception.
	at org.junit.Assert.fail(Assert.java:88)
	at org.apache.cassandra.net.CallbackPropertyTest.validateExceptionState(CallbackPropertyTest.java:773)

I'm assuming setting an ideal CL should never fail the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at this; I added the idealCL stuff last week (and by that I mean I rewrote it) and these property tests predated that change. I didn't think to update the property tests to include that aspect as well (given it's exercised at least nominally in WriteResponseHandlerTest); I'm glad you thought to check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this is because in the runMultiDCTest case, we don't actually initialize writeReplicaPlan but instead relied on eachQuorumPlan and localQuorumPlan. This is an artifact of the previous hierarchy of having an abstract base that each individual CL-based property test extended from and needing that context constructed and stored in the base to be available on each test invocation. I've tweaked it so all the write side tests regardless of CL use writeReplicaPlan, which will then not be null and then not NPE on the CallbackResponseTracker#buildDCMap call. Prelim "set idealCL and run this" is passing fine on both testEachQuorum and testLocalQuorum locally.

Open question as to whether we think we should gracefully allow passing in a null ReplicaPlan.ForWrite as an argument to that method; it's kind of nonsensical to say "build a DC map for me of this Non-Thing".

@jonmeredith
Copy link
Contributor

Looking some more at coverage testing for the CallbackResponseTracker QT test.

It currently doesn't exercise a few aspects:

  • idealCL
  • pending endpoints during bootstrap
  • building a DCMap with non NetworkTopolgyStrategy -- not surprising as not exercised anywhere but EACH_QUORUM but that means it isn't tested, perhaps it should just assert that NTS is in use instead and leave anybody using it differently to extend the method.
  • recording duplicate responses
  • getting finalized failures (before or after finalization)

@jmckenzie-dev
Copy link
Contributor Author

Looking some more at coverage testing for the CallbackResponseTracker QT test.

It currently doesn't exercise a few aspects:

  • idealCL
  • pending endpoints during bootstrap
  • building a DCMap with non NetworkTopolgyStrategy -- not surprising as not exercised anywhere but EACH_QUORUM but that means it isn't tested, perhaps it should just assert that NTS is in use instead and leave anybody using it differently to extend the method.
  • recording duplicate responses
  • getting finalized failures (before or after finalization)

I'm pro adding all of the above w/the following context:

  1. idealCL coverage: just on or off, not necessarily checking that it's state as calculated is correct since we do that in WriteResponseHandlerTest, albeit at a reduced coverage of permutations. Since it's a metric thing I'm not too worried about property testing the hell out of it, at least as part of this PR
  2. asserting NTS in use and blowing up if not
  3. duplicate responses (I'd added this manually in some of the property tests at one point and then it dropped back out during rebase across... many branches. Will re-add)
  4. finalized failures - should be able to snapshot 1-N of the callbacks during run through and then duplicate re-running them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants