-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant manifest generations due to race condition #6850
Comments
The IT logs from the failed subtest (i=1). Note how two of the workers receive token
|
The request logs for that subtest, but only the initial three request, not the redirects. There are three requests, one for each worker. Two requests run more or less in parallel. One is a straggler. The first two requests observe the execution from the previous subtest (i=0) and initiate the second execution. One request actually initiates it, the other one idempotently adopts it. This execution finishes before the straggler gets to attempting to start any execution at all. When it finally does, it observes two complete executions: the one from the previous subtest (line 79) and the one from this subtest (line 81), the one that just completed. The straggler then launches a third execution (line 83) because it is still under the, albeit false, impression that the manifest object is still absent, since it only checks for manifest presence once, at the beginning.
|
We can identify the following relevant event types: A: the service detects that the requested manifest is absent The following sequence represents the race:
|
@hannes-ucsc: "Assignee to file a partial PR that adds a workaround by setting the number of workers for that integration test from 3 to 1." |
The race is transiently manifesting in a failed assertion in an integration test that is intended to catch them.
https://gitlab.azul.data.humancellatlas.org/ucsc/azul/-/jobs/71737
https://ucsc-gi.slack.com/archives/C705Y6G9Z/p1737762174254489?thread_ts=1737595686.060409&cid=C705Y6G9Z
The text was updated successfully, but these errors were encountered: