Description
Current behavior 😯
The gix-ref-tests::refs packed::iter::performance
test fails on a riscv64 machine when I run all tests and force archives to be regenerated by running:
GIX_TEST_IGNORE_ARCHIVES=1 cargo nextest run --no-default-features --features max-pure --all --no-fail-fast
The way it fails is:
Error: PermanentlyLocked { resource_path: "make_repository_with_lots_of_packed_refs", mode: AfterDurationWithBackoff(180s), attempts: 202 }
With a bit of further context:
SLOW [>180.000s] gix-ref-tests::refs packed::iter::performance
FAIL [ 180.162s] gix-ref-tests::refs packed::iter::performance
--- STDOUT: gix-ref-tests::refs packed::iter::performance ---
running 1 test
test packed::iter::performance has been running for over 60 seconds
test packed::iter::performance ... FAILED
failures:
failures:
packed::iter::performance
test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 124 filtered out; finished in 180.12s
--- STDERR: gix-ref-tests::refs packed::iter::performance ---
Error: PermanentlyLocked { resource_path: "make_repository_with_lots_of_packed_refs", mode: AfterDurationWithBackoff(180s), attempts: 202 }
PASS [ 212.628s] gix-ref-tests::refs packed::find::find_speed
------------
Summary [ 631.358s] 2538 tests run: 2537 passed (8 slow, 1 leaky), 1 failed, 3 skipped
FAIL [ 180.162s] gix-ref-tests::refs packed::iter::performance
error: test run failed
So this looks like a deadlock, though I don't know that it is one. This does not happen on an x64 machine, though the x64 machine I have been testing on is faster, though it has the same number of cores as the riscv64 machine I've been testing on. They both have four cores. The riscv64 machine has this cat /proc/cpuinfo
output and more information on the details of its hardware is available in zlib-ng/zlib-ng#1705 (comment) and on this page.
Full details of all max-pure
runs are in this gist, which supersedes the earlier gist linked in rust-lang/libz-sys#218 (comment) where I first noticed the problem. The significance of that older gist is that it shows the problem happens even with max
. Other runs I've done to investigate this have been with max-pure
, to examine the problem independently of rust-lang/libz-sys#200 (zlib-ng/zlib-ng#1705).
The strangest thing is that the problem only occurs if I rebuild the code. Specifically, cleaning with git restore .
followed by gix clean -xd -m '*generated*' -e
is insufficient to cause the next full run to have the failure, but cleaning with git restore .
followed by gix clean -xde
is sufficient to cause the next full run to have the failure. I have verified that content under target/
is the only content reported by gix clean -xdn
as eligible for deletion after running gix clean -xd -m '*generated*' -e
, so it appears that, somehow, rebuilding is part of what is needed to produce the problem.
Although this is present in the readme for the new gist, the gist "table of contents" serves both as a summary of what kinds of runs produce what results and the order in which I did the runs, and as a collection of links to the nine runs in case one wishes to examine them in detail. Therefore, I quote it here:
All cleaning included
git restore .
where applicable, even if not mentioned in the summaries below.
run-1-x64-use-archives.txt
- start out completely clean, noGIX_TEST_IGNORE_ARCHIVES
, passesrun-2-x64-ignore-archives.txt
- clean withgix clean -xde
and setGIX_TEST_IGNORE_ARCHIVES=1
, passesrun-3-riscv64-use-archives.txt
- start out completely clean, passesrun-4-riscv64-ignore-archives.txt
- clean withgix clean -xde
and setGIX_TEST_IGNORE_ARCHIVES=1
, failurerun-5-riscv64-ignore-archives-rerun.txt
- equivalent rerun, clean withgix clean -xde
(though not shown in transcript) and setGIX_TEST_IGNORE_ARCHIVES=1
, failurerun-6-riscv64-ignore-archives-rerun-without-cleaning.txt
- no cleaning, setGIX_TEST_IGNORE_ARCHIVES=1
, passesrun-7-riscv64-ignore-archives-rerun-partial-clean.txt
- clean withgix clean -xd -m '*generated*' -e
and setGIX_TEST_IGNORE_ARCHIVES=1
, passesrun-8-riscv64-ignore-archives-rerun-partial-clean-again.txt
- equivalent rerun, clean withgix clean -xd -m '*generated*' -e
and setGIX_TEST_IGNORE_ARCHIVES=1
, passesrun-9-riscv64-ignore-archives-rerun-full-clean.txt
- rerun as in runs 4 and 5, clean withgix clean -xde
and setGIX_TEST_IGNORE_ARCHIVES=1
, failure
All tests in the newer gist were run at the current tip of main, 612896d. They were all run on Ubuntu 24.04 LTS systems. The two x86 runs were on one system, and the seven riscv64 runs were, of course, on a different system (but the same system as each other). Because the older gist also shows the problem, it is not new, at least not newer than be2f093.
Expected behavior 🤔
All tests should pass.
Secondarily, it seems to me that when git restore .
has been run and there are no intervening commands that modify the working tree, gix clean -xd -m '*generated*' -e
should be as good as gix clean -xde
at resetting state associated with fixtures that are forced to be rerun due to the use of GIX_TEST_IGNORE_ARCHIVES=1
, at least when repeated re-running has not identified any nondeterminism in failures that occur related to fixtures. However, I am not certain that there is necessarily a specific, separate bug that corresponds to this second expectation.
Git behavior
Probably not applicable, unless the speed at which the git
commands in the fixture script are run turns out to be a contributing factor (but I think that would still not be git
behavior that corresponds to the code in gitoxide
where the test fails).
Steps to reproduce 🕹
On a riscv64 machine in GNU/Linux (and specifically Ubuntu 24.04 LTS, if one wishes to reproduce the setup I used), either clone the gitoxide
repo, or run git restore .
and gix clean -xde
in the already cloned gitoxide
repo (after ensuring that one has no valuable modifications that could be lost by doing so). Then run:
GIX_TEST_IGNORE_ARCHIVES=1 cargo nextest run --no-default-features --features max-pure --all --no-fail-fast
To observe that it reliably happens when run this way, run gix clean -xde
and then that test command again. This can be done as many times as desired.
To observe that rebuilding seems to be required to produce the problem, replace gix clean -xde
with gix clean -xd -m '*generated*' -e
in the above procedure and verify that (except on the first run, which is already clean) the problem does not occur. Going back to gix clean -xde
verifies that the order is not the cause.
In case it turns out to be relevant, the git
command on the system I used reports its version when running git version
is as git version 2.43.0
. It is more specifically the downstream version 1:2.43.0-1ubuntu7.1
(packaged for Ubuntu 24.04 LTS) as revealed by running apt list git
.
It occurs to me that the inability to produce the problem without having just recently rebuilt might potentially be due to the effect of rebuilding on dynamic clock speed. However, I would expect at least some nondeterminism in the failures to be observed if this were the case, since the failing test is not one of the earliest tests to run. I may be able to investigate that further if other approaches do not reveal what is going on here.