-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instability of some numerical results (condor vs CI) #200
Comments
After #198 the test of |
this also seems relevant: #60 |
Could you clarify where the failure occurred? Naively I would have thought when run on GitHub CI the platform should be near identical, so tests should fail randomly. Though results from condor could still be different. |
We're seeing it here #204 with millipede wilks |
Following up, the issue in #204 is not related to this after all |
some bookkeeping of sporadic test failures, cause still unknown https://github.com/icecube/skymap_scanner/actions/runs/6180290095/job/16776550606 |
The spline tables for all these cases are obtained using |
This is a good idea if we continue to rely on remote storage. What do you think in regards to #166? |
I think the containers can be dataless and cvmfs-less if we really wanted, but in that case we should ensure the file transfer mechanism is robust. I don't think I have run into such issues on sub-2, but that could be because it's using the spline tables on cvmfs rather than over http. |
I believe issues in downloads should result in truncated files, rather than corrupted data, but a checksum is definitely a good idea. |
I've seen corrupted data without a truncation when transferring files. It's a low probability to get a bit flip that passes the TCP checksum, but it does happen when moving around enough bytes. Note that this does not happen with CVMFS, since that uses checksums internally. The easiest thing to do is to have a checksum file next to the file you want to download, and if it's available then perform the checksum test. |
The numerical issues in the tests are a quite confounding. They seem to oscillate between two results for millipede_wilks. E.g. here and here. Basically for what appears to be the same seed particle the millipede unfolded particle can be different. Checking the OS indicates the CI runners are on the same OS (ubuntu_amd64). It doesn't rule out data corruption over wget, but it appears not completely random either. |
It looks like numpy picks up avx512 on select processors
vs
could explain why splinempe and wilks are the ones to fail. |
More testing indicates it's not something in numpy/python/seeding but in minimization/fitting with millipede (and possibly splinempe?). See discussion on slack for comparisons |
Possibly relevant numpy/numpy#23523 |
That reminds me that you can do this to force different optimizations with OpenBLAS: For avx2:
For avx:
|
That did it for avx2. This matches what I'm seeing on NPX with
Setting to Sandybridge does not recover what I get on cobalt6 though
|
Testing on AMD chip with avx512 indicates that it's default is equivalent to Haswell, which is avx2
|
Testing a variety of OpenBlas flags on
Cobalt06
|
As we have been discussing for a while, it seems sometimes the numerical results of millipede are not stable across runs on different platforms, in spite of containerisation.
While I am not sure there is an "issue" to solve, I would like to track here some observations about this behaviour. Updates to come.
The text was updated successfully, but these errors were encountered: