Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results with Archer #42

Open
rolandschulz opened this issue May 17, 2017 · 5 comments
Open

Incorrect results with Archer #42

rolandschulz opened this issue May 17, 2017 · 5 comments
Labels

Comments

@rolandschulz
Copy link

With Archer (dc4e363) build with out of source with LLVM 4.0 and OMP-TR4. Running GROMACS unit tests:

git init gromacs && cd gromacs
git fetch https://gerrit.gromacs.org/gromacs refs/changes/48/6648/1 && git checkout FETCH_HEAD
mkdir build && cd build
CC=clang-archer CXX=clang-archer++ cmake -GNinja -DGMX_OPENMP_MAX_THREADS=256 -DGMX_BUILD_HELP=OFF -DBUILD_SHARED_LIBS=yes -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGMX_HWLOC=no .. -DGMX_SIMD=None -DTMPI_ATOMICS_DISABLED=yes -DGMX_MPI=on
ninja check

Produces:

The following tests FAILED:
          7 - EwaldUnitTests (Failed)
         12 - MdrunUtilityMpiUnitTests (Failed)
         23 - CorrelationsTest (Failed)

Compiling and running with clang 4.0 without archer (with or without using OMP-TR4) all tests pass. All unit tests also pass with VS2015, ICC 16&17, GCC 4.8-7.1 and a few other compilers we check less regularly. Thus it is highly unlikely that the unit tests failures are GROMACS source code problems.

@dongahn
Copy link
Contributor

dongahn commented May 17, 2017

@rolandschulz: thank you for using Archer on your build-and-test system.

Could you describe how those unit tests failed more specifically? One of the common cases of something like this in our environment is Archer/TSan detects an error (directly in a unit test code or some other tester components) and causes it to exit with a return code 66. https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags

Could you try to set EnvVar, TSAN_OPTIONS="exitcode=0" before you run these unit tests and see if this makes any difference?

@rolandschulz
Copy link
Author

All 3 failures are incorrect results not TSAN errors. They also all 3 still occur with OMP_NUM_THREADS=1. Thus it seems that the LLVM pass added by archer causes incorrect results. Compiling and run all unit tests with clang 4.0 with Tsan (without archer) all unit tests give the correct result (some tests produce false positive tsan warnings - different tests from the ones which fail with archer) E.g. running the first test which fails by itself gives

./bin/ewald-test --gtest_filter=SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
Note: Google Test filter = SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput1/PmeBSplineModuliCorrectnessTest
[ RUN      ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
../src/testutils/refdata.cpp:900: Failure
  In item: /X/Length
   Actual: '-1782689792'
Reference: '64'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
../src/testutils/refdata.cpp:900: Failure
  In item: /Y/Length
   Actual: '-1782689792'
Reference: '32'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
../src/testutils/refdata.cpp:900: Failure
  In item: /Z/Length
   Actual: '-1782689792'
Reference: '64'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
[  FAILED  ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0, where GetParam() = (12-byte object <40-00 00-00 20-00 00-00 40-00 00-00>, 3, 4-byte object <00-00 00-00>) (52 ms)
[----------] 1 test from SaneInput1/PmeBSplineModuliCorrectnessTest (55 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (56 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0, where GetParam() = (12-byte object <40-00 00-00 20-00 00-00 40-00 00-00>, 3, 4-byte object <00-00 00-00>)

 1 FAILED TEST

@dongahn
Copy link
Contributor

dongahn commented May 17, 2017

Thanks. Seems something that @simoatze should look into. @rolandschulz: how should we reproduce this failures?

@rolandschulz
Copy link
Author

I provided the git, cmake, and ninja commands in my first message. That should let you be able to reproduce the error. If you have difficulty reproducing I'm happy to help.

@simoatze
Copy link
Member

@rolandschulz thanks for reporting these issues. I'll look into it as soon as I can and get back to you.

@jprotze jprotze added the bug label May 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants