Skip to content

Conversation

dfordivam
Copy link
Contributor

@dfordivam dfordivam commented Oct 3, 2025

As the validator backend is being stopped in this test, the event history sync is expected to not be in sync with other SVs

Fixes #2513

Pull Request Checklist

Cluster Testing

  • If a cluster test is required, comment /cluster_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.
  • If a hard-migration test is required (from the latest release), comment /hdm_test on this PR to request it, and ping someone with access to the DA-internal system to approve it.

PR Guidelines

  • Include any change that might be observable by our partners or affect their deployment in the release notes.
  • Specify fixed issues with Fixes #n, and mention issues worked on using #n
  • Include a screenshot for frontend-related PRs - see README or use your favorite screenshot tool

Merge Guidelines

  • Make the git commit message look sensible when squash-merging on GitHub (most likely: just copy your PR description).

As the validator backend is being stopped in this test, the event
history sync is expected to not be in sync with other SVs

Signed-off-by: Divam <[email protected]>
@dfordivam
Copy link
Contributor Author

dfordivam commented Oct 3, 2025

It seems that CI had some problem, a re-run might fix it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logs pointed respectively:
Simtime: Test failed: 'SvTimeBasedRoundMgmtIntegrationTest/round management', message: forAll failed, because...
Wallclock: Test failed: 'SvReconcileSynchronizerConfigIntegrationTest/SV automation reconcile amulet config change to domain parameter', message: forAll failed, because...

This fix might still make sense, but it's not the one we stumbled upon, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in wallclock this test was also run, and the failure was reported for tests which ran together.

Copy link
Contributor

@OriolMunoz-da OriolMunoz-da Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I'm saying is that the failure happened, for wallclock tests, in SvReconcileSynchronizerConfigIntegrationTest as this log indicates:

│2025-10-02T13:23:23.288Z [⋮] WARN - c.d.c.LogReporter:reporter=scala-test (---) - Test failed: '***SvReconcileSynchronizerConfigIntegrationTest***/SV automation reconcile amulet config change to ┤
domain parameter', message: forAll failed, because:                                                                                                                                           │
│  at index 0, Vector((EventHistoryItem

Furthermore, the test SvDevNetReonboardingIntegrationTest.scala (which you've disabled here), which was scheduled to run afterwards, does not even manage to run because the plugin failed before it:

│2025-10-02T13:23:24.133Z [⋮] ERROR - o.l.s.i.t.***SvDevNetReonboardingIntegrationTest***:SvDevNetReonboardingIntegrationTest (20e07fc75471f6eb977531fc84ce3f87-newSpan--d5cbdb014e5d3ee8) - Failed ┤
creating environment: Creating fixture (EnvironmentSetup.scala:154) java.io.UncheckedIOException: Could not create Prometheus HTTP server

The error "Could not create Prometheus HTTP server" happens for silly reasons when the plugin fails, as it did above.
The point is, the flake happened in SvReconcileSynchronizerConfigIntegrationTest, not in SvDevNetReonboardingIntegrationTest, which didn't manage to run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the logs again, the SvDevNetReonboardingIntegrationTest did ran before the failure happened.

2025-10-02T13:14:30.5404944Z [info] Test still running after 1 minute, 9 seconds: suite name: SvDevNetReonboardingIntegrationTest, test name: Reonboarding an SV
 with the same name removes the old SV from PartyToParticipant and the Decentralized Namespace. 
 
 2025-10-02T13:23:00.2196524Z [info] Test still running after 9 minutes, 39 seconds: suite name: SvDevNetReonboardingIntegrationTest, test name: Reonboarding an 
SV with the same name removes the old SV from PartyToParticipant and the Decentralized Namespace. 

Though I agree that this fix is not sufficient to avoid the failure happening due to the plugin.
If a test which disables the EventHistorySanityCheckPlugin is run along with another test which does not disable it, then the failure could still happen in the test which did not disable the plugin, since the underlying DB used by the tests is the same in both. (AFAIK the DB is not reset when running different tests).

So the correct fix would be to run the tests requiring EventHistorySanityCheckPlugin to be disabled, separate from other tests. Perhaps the tests with the plugin disabled should be run in a separate bucket? (as done in test-full-class-names-*.log?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(let me number things so I don't get lost 😄 )

  1. ah, I see where the confusion is 😃 the test shows up as "running" according to the Test still running after log, but there's also this:
│2025-10-02T13:23:23.285Z [⋮] DEBUG - c.d.c.i.ConcurrentEnvironmentLimiter$ (---) - org.lfdecentralizedtrust.splice.integration.tests.SvDevNetReonboardingIntegrationTest: Queued => Running after 606s (3 pending)

and

│2025-10-02T13:23:24.131Z [⋮] DEBUG - o.l.s.i.t.SvDevNetReonboardingIntegrationTest:SvDevNetReonboardingIntegrationTest (20e07fc75471f6eb977531fc84ce3f87-newSpan--d5cbdb014e5d3ee8) - Starting creating environment for SvDevNetReonboardingIntegrationTest, test 'Reonboarding an SV with the same name removes the old SV from PartyToParticipant and the Decentralized Namespace':

so it's not actually doing any work until 13:23:24.131Z... sorry I didn't mention/catch that earlier 🤦


  1. As for running the tests using the plugin... we have the same problem in our other 2 sanity check plugins. So far when checking for test failures we've accepted "Prometheus failure = some other test failed before, likely in a plugin". I'm not too worried about that behavior (as annoying as it is), but of course we still want to figure out how it's flaking, because it could be a bug.

  1. I'm happy to merge this PR because it still fixes stuff, but SvTimeBasedRoundMgmtIntegrationTest and SvReconcileSynchronizerConfigIntegrationTest might flake again.

Copy link
Contributor Author

@dfordivam dfordivam Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, how did you see these DEBUG messages?

Given this, the diff of the events reported by the plugin, after running the
SvReconcileSynchronizerConfigIntegrationTest, could be due to some real problem in synchronizer's workings / sync of events logic.

The SV4 synchronizer emitted a reject verdict with record_time "1:39:36.000156Z" (having confirming_parties: [dso, sv4, digital-asset-eng-4]. This verdict was captured in SV1's (or SV2) scan-app, but was not part of the other SV's scan events.
Its hard to say what could be the cause of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, how did you see these DEBUG messages?

it's in the logs in the drive I shared with you, let me know if you don't have access / lost the link

You should also be able to find there by updateid (or similar) the verdict and updates as they happen across all 4 SVs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Mismatched EventHistoryItems in EventHistorySanityCheckPlugin

5 participants