-
-
Notifications
You must be signed in to change notification settings - Fork 339
armv7-linux-androideabi cross test "attempt to subtract with overflow" in miniz_oxide 0.8.5 #1894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Even though I haven't read (or understood) the entire issue yet, I thought it's a good idea to ping @oyvindln in case they have seen this already. |
Thanks. Besides where quoted or linked above, the failures can be observed on CI in #1895, such as in this currently most recent run, where the first relevant backtrace is here. |
Thanks for the ping - this does look like it could be a regression in miniz_oxide. I can do some testing on my android TV box that runs armv7 android and see if I can hit the issue. Would be easiests if it was possible to reduce the issue to just the input to miniz_oxide. Is there some way to more easily run tests on 32-bit and s390x on github CI? EDIT: So I didn't look closely enough at first - it's failing on 32-bit armv7 android but not on i386 (or arm32v7) - that makes it a lot more confusing. If it was a simple regression in miniz_oxide from some changes I did would have expect it to fail on other 32-bit platforms as well but that doesn't seem to be the case. The compress_fast function is one part that wasn't changed at all between those versions, the changes were all elsewhere. I did some changes that yielded performance improvements to it more recently but those changes have not landed in a release yet. |
Current behavior 😯
Background
cf7f34d (#1882) bumped
prodash
from 29.0.0 to 29.0.1 inCargo.toml
, as well as various resolved dependency versions inCargo.lock
, which included bumpingminiz_oxide
from 0.8.0 to 0.8.5.Around that time and starting somewhat earlier, I was experimenting with running tests via
cross
in the hope of finding a way to reproduce somes390x-unknown-linux-gnu
failures without s390x hardware. As noted in #1890, I was unable to reproduce those failing s390x tests that way, and I still have not managed to do so.However, I also experimented with another target in
cross
, with the intention that it would help me understand what failures were target-specific. I didn't get failures in s390x, but I expected I would and that the other target could serve as a kind of control. It turned out that s390x was the control and the other target eventually had unexpected failures.Specifically, for the other target, I chose
armv7-linux-androideabi
. I chose this because it had been partially tested viacross
on CI before the test was replaced with a more comprehensive non-Android arm32v7 containerized (but non-cross
) test on ARM hardware in #1777.In its past testing on CI,
armv7-linux-androideabi
had problems using the externalgit
command. Upgradinggit
in thecross
container and patching the runner script to avoid preloading an unusable shared library overcame this, and I was able to run all tests withcross
using that target.All tests all passed--aside from one fuzzing-related performance test that had to be skipped due to slowdown from QEMU emulation of an unrelated architecture without hardware acceleration, and a couple of cases where tests were marked with the wrong
#[cfg(...)]
, which was easily fixed. But...Then
prodash
, and various locked dependencies, were upgradedStarting after a rebase that included cf7f34d in the history, several test cases failed in
cross
runs on thearmv7-linux-androideabi
target. In the recent run presented here, this is eight tests. Runninggit revert --no-commit cf7f34d
and rerunning the tests confirmed that this change is what triggered the failures.I did a full
gix clean -xde
between the runs to ensure they did not contaminate each other. I also ran it a number of times with the intention of verify that the effect is consistent. In initially thought the effect was completely consistent based on some very similar looking runs, but actually the tests that fail differ. I suspect this is related to threading. I sometimes forced the Docker image to be rebuilt in between as well, to ensure contamination in the Docker image itself was not the cause (though I was not aware of a way that could happen), which it was not.Full test run output of the failures, and of how they go away without the changes from cf7f34d, are presented in this gist. Although not shown there, I have also tried running the tests with
--locked
. This does not make a difference, which makes me think that upgrade toprodash
itself is relevant, though the way it is relevant could still in principle be through a dependency.The failures are all specific to
armv7-linux-androideabi
withcross
, or at least they do not happen in any of the other ways I have been tested or that are tested on CI. In particular, this does not happen on the same system when running the tests natively, nor when running them on that system withcross
with thes390x-unknown-linux-gnu
target. In particular, they do not fail on either of the 32-bit targets tested (withoutcross
) on CI, one of which is a non-Android ARMv7 target.Indirect dependencies that may be relevant
The changes in prodash 29.0.1 are listed as replacing
humantime
withjiff
and bumpingjiff
to version 0.2, which the diff confirms. Assuming the latest patch versions were used in both 0.1 and 0.2 in practice, there are a number of relevant differences, but none seem like they would affectminiz_oxide
.The reason I was looking for an effect on
miniz_oxide
is that the test failures are panics that happen there.gitoxide
usesminiz_oxide
throughflate2
. If there is any connection toprodash
orjiff
, I don't know what it is. But if there isn't, then I don't know why passing--locked
made no difference: with or without--locked
, the failures happened with the changes from cf7f34d, and did not happen without them.But if we look at
Cargo.lock
changes in cf7f34d, theflate2
andminiz_oxide
crates were among those upgraded:Furthermore the versions of
miniz_oxide
listed inCargo.lock
seem to be used even when I do not pass--locked
or--frozen
. The compiled version is 0.8.5 when the tests fail, and the compiled version is 0.8.0 when the tests pass with the revert, as though locked versions are being used. Perhaps there is something I don't understand about whenCargo.lock
versions are guaranteed to be used?Which test modules have failures
Because I am using
cross
, I am usingcargo test
(viacross test
) rather thancargo nextest
. The output is accordingly much harder (at least for me) to read. Here's the broad summary at the end of the whole run:There are always failures in these three modules and no others. Details about all the specific failures in the linked gist are as follows.
Summary of
gix
failuresThe failures to which
-p gix --test gix
pertains were:Here are links to their full details in the gist:
remote::fetch::blocking_and_async_io::fetch_more_packs_than_can_be_handled
remote::fetch::blocking_and_async_io::fetch_with_multi_round_negotiation
repository::object::commit_as::specify_committer_and_author
Summary of
gix-odb-tests
failuresThe failures to which
-p gix-odb-tests --test integrate
pertains were:Here are links to their full details in the gist:
odb::store::dynamic::write
odb::store::loose::write::collisions_do_not_cause_failure
odb::store::loose::write::it_writes_objects_with_similar_permissions
odb::store::loose::write::read_and_write
Summary of
gix-pack-tests
failuresThe failure to which
-p gix-pack-tests --test pack
pertains was:Here is a link to its full details in the gist:
pack::data::output::count_and_entries::traversals
Analysis
I don't know what's going on, but all of the failures are "attempt to subtract with overflow" errors in
miniz_oxide
. This is the case both in the run shown in the gist and in other runs where the tests that fail are sometimes slightly different. All the failures seem to resemble each other pretty closely. Here's the first failure output with its backtrace, to illustrate (the others can be viewed in the gist through the specific links, lest this issue become too long):Looking at this step in the backtrace:
That is a reference to this line in
miniz_oxide
0.8.5:The blame shows the most recent change to that line as a removal of the
as usize
cast onprobe_pos
. This is from commit Frommi/miniz_oxide@0f50464 where severalas usize
casts are removed. But I think this cannot be the cause of the new failure, since this change is present in all versions ofminiz_oxide
since 0.7.2.Inspecting a broader diff, there are substantial changes, including two
deflate/core.rs
from 0.8.0 to 0.8.5, but I don't know which would cause this. Maybe something in the changelog would give a hint.Assuming they are able to be resolved without causing problems with other dependencies, I might manage to bisect the
miniz_oxide
versions, or even onminiz_oxide
commits, on one of the failing gitoxide tests. But perhaps you will have a more specific idea of what is going on.Expected behavior 🤔
Ordinarily I would say I expect that all tests pass. But
cross
images are strange environments in which to rungitoxide
tests, because they are x86-64 containers in which code of another architecture is built and then, through QEMU, run. I generally expect across
image to require significant customization to be usable to rungitoxide
tests, sincegitoxide
interacts with various parts of the system, even in amax-pure
build as used here.If I had always observed these failures, then I would have assumed the problem was just with the container as I had set it up, and not assumed a bug. But it is strange that it was working fine (once I got
git
to a new enough version and keptld.so
message from pollutingstderr
, as described below), but then cf7f34d broke it. That suggests to me that there is a bug somewhere.Maybe there is a long-standing bug in gitoxide that this surfaces. Or maybe there is something wrong with the tools in the
cross
image I am using (either in the original image, or from my modifications, detailed below). Or maybe there is a bug inminiz_oxide
, or in a crate that uses it such asflate2
.Git behavior
In Git itself
In the sense of
git
behavior that corresponds to the affected behavior ofgitoxide
, this is not applicable as far as I know.In how we use it
I also suspect that the way the image is customized to be able to use
git
for tests that run it, either directly or through.gitignore
d fixtures, is not relevant here. But just in case it is, the mostgit
-relevant changes in the container are:Enable the
git-core
PPA.Upgrade
git
. This is still the x86-64git
. (cross
images are x86-64 images with cross toolchains installed. It is possible to installgit
in across
image for somecross
targets, and but probably not this one.)Patch the runner script
cross
uses, to keep it from settingLD_PRELOAD
to include a.so
file implementing the C++ standard library. This could be useful in a C++ Android program, but whenld.so
attempts to preload it in an x86-64 executable, it writes this text to standard error:Then the program actually runs. But when we capture standard error and parse or make assertions about it, we get failures. So I kept
LD_PRELOAD
from being set to that path, to avoid this.Full details of how I customized the container, including those not related to making
git
usable, are included below.Steps to reproduce 🕹
Customizing the container
I'll open a pull request with the changes. That pull request will not fix this issue, but it will be possible to look at the changes in it, at least initially. However, I intend also to improve that pull request, both to be able to show these failures better--there are other problems on CI related to configuring
binfmt_misc
to use QEMU that obscure them currently--and to include further changes, both that I have already done and that I plan to do, for testing s390x. So the information about how the observations reported here were produced seems valuable to present here.I made this
etc/docker/test-cross.toml
file:And this
etc/docker/Dockerfile.test-cross
file:And added these recipes to the
justfile
:Running the tests
Then, as shown in the gist, I ran various commands, the most important of which were this command, producing the errors:
And these commands, cleaning the build directory and running the tests with the changes of cf7f34d undone. (To do this recently without having to manually resolve a conflict--but not originally, as shown in this older gist--I had to also revert 00d1a00, but that is not the cause.)
I have done this quite a few times; the failures appear deterministic. Enabling the backtrace is not the cause. (That would be strange, but who knows--this is already strange.) I have also run the tests multiple times without
RUST_BACKTRACE=1
.The images are not stale
I have also tried forcing all Docker images to be reacquired or rebuilt, even before the second run, by running the following commands, where the third command is the one that actually deletes images, and the purpose of the others is to verify that there are no containers running or stopped, that there are images before I delete them, and that there are no images after I delete them. Please note that you might not want to run these commands, since they remove any Docker images that are not use by containers.
I would not expect that to have made a difference, and it never did. The main value of it was to let me make sure the first run in the gist would show the whole process of customizing the Docker image, in case somehow that turns out to be relevant.
Minor patches to the test suite
Finally, though I think this is the least important for this issue, the changes to allow tests to pass instead of failing, or to run and pass instead of being skipped, were:
There are a few other changes on the feature branch, but you are unlikely to want to see them in this issue. They do not affect code exercised in any of the tests, and they shall be included in the forthcoming PR (which, to be clear, does not hope to fix this issue).
Edits: The pull request
That PR is #1895. The commit at
this pointthe point described above is 5d4eed7. Later commits may include modifications to some of the files shown here. (The history to that commit can be seen here, though it's possible that could eventually be GC'd following a rebase.)As of e207b27 ("Configure binfmt_misc for QEMU on CI"), CI in #1895 shows the same bug as here, without any unrelated failures. So that may now be a good way to examine the bug. (Comparison of the failures there to those in the gist may also illuminate something about what changes and what stays the same across different runs.) For convenience, here's a failing run, and:
-p gix --test gix
failures-p gix-odb-tests --test integrate
failures-p gix-pack-tests --test pack
failuresThe possible role of
cargo test
(vs.cargo nextest
) seems smallOn this separate branch (currently at this commit), I ran an experiment where other tests, especially the
test-32bit
jobs, are changed to be more like these, including by making them usecargo test
. This does not ever seem to cause them to fail.So while the nondeterminism in exactly which test cases fail suggests a possible role of
cargo test
using multithreading, it does not seem to a significant contributing factor to the effect:armv7-linux-androideabi
.cargo test
so they are more similar to howarmv7-linux-androideabi
is tested.The text was updated successfully, but these errors were encountered: