Incorrect results with Archer #42

rolandschulz · 2017-05-17T16:32:13Z

With Archer (dc4e363) build with out of source with LLVM 4.0 and OMP-TR4. Running GROMACS unit tests:

git init gromacs && cd gromacs
git fetch https://gerrit.gromacs.org/gromacs refs/changes/48/6648/1 && git checkout FETCH_HEAD
mkdir build && cd build
CC=clang-archer CXX=clang-archer++ cmake -GNinja -DGMX_OPENMP_MAX_THREADS=256 -DGMX_BUILD_HELP=OFF -DBUILD_SHARED_LIBS=yes -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGMX_HWLOC=no .. -DGMX_SIMD=None -DTMPI_ATOMICS_DISABLED=yes -DGMX_MPI=on
ninja check

Produces:

The following tests FAILED:
          7 - EwaldUnitTests (Failed)
         12 - MdrunUtilityMpiUnitTests (Failed)
         23 - CorrelationsTest (Failed)

Compiling and running with clang 4.0 without archer (with or without using OMP-TR4) all tests pass. All unit tests also pass with VS2015, ICC 16&17, GCC 4.8-7.1 and a few other compilers we check less regularly. Thus it is highly unlikely that the unit tests failures are GROMACS source code problems.

The text was updated successfully, but these errors were encountered:

dongahn · 2017-05-17T16:49:36Z

@rolandschulz: thank you for using Archer on your build-and-test system.

Could you describe how those unit tests failed more specifically? One of the common cases of something like this in our environment is Archer/TSan detects an error (directly in a unit test code or some other tester components) and causes it to exit with a return code 66. https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags

Could you try to set EnvVar, TSAN_OPTIONS="exitcode=0" before you run these unit tests and see if this makes any difference?

rolandschulz · 2017-05-17T17:48:21Z

All 3 failures are incorrect results not TSAN errors. They also all 3 still occur with OMP_NUM_THREADS=1. Thus it seems that the LLVM pass added by archer causes incorrect results. Compiling and run all unit tests with clang 4.0 with Tsan (without archer) all unit tests give the correct result (some tests produce false positive tsan warnings - different tests from the ones which fail with archer) E.g. running the first test which fails by itself gives

./bin/ewald-test --gtest_filter=SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
Note: Google Test filter = SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput1/PmeBSplineModuliCorrectnessTest
[ RUN      ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
../src/testutils/refdata.cpp:900: Failure
  In item: /X/Length
   Actual: '-1782689792'
Reference: '64'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
../src/testutils/refdata.cpp:900: Failure
  In item: /Y/Length
   Actual: '-1782689792'
Reference: '32'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
../src/testutils/refdata.cpp:900: Failure
  In item: /Z/Length
   Actual: '-1782689792'
Reference: '64'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
[  FAILED  ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0, where GetParam() = (12-byte object <40-00 00-00 20-00 00-00 40-00 00-00>, 3, 4-byte object <00-00 00-00>) (52 ms)
[----------] 1 test from SaneInput1/PmeBSplineModuliCorrectnessTest (55 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (56 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0, where GetParam() = (12-byte object <40-00 00-00 20-00 00-00 40-00 00-00>, 3, 4-byte object <00-00 00-00>)

 1 FAILED TEST

dongahn · 2017-05-17T17:55:32Z

Thanks. Seems something that @simoatze should look into. @rolandschulz: how should we reproduce this failures?

rolandschulz · 2017-05-17T18:36:31Z

I provided the git, cmake, and ninja commands in my first message. That should let you be able to reproduce the error. If you have difficulty reproducing I'm happy to help.

simoatze · 2017-05-17T18:37:44Z

@rolandschulz thanks for reporting these issues. I'll look into it as soon as I can and get back to you.

jprotze added the bug label May 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect results with Archer #42

Incorrect results with Archer #42

rolandschulz commented May 17, 2017

dongahn commented May 17, 2017

rolandschulz commented May 17, 2017

dongahn commented May 17, 2017

rolandschulz commented May 17, 2017

simoatze commented May 17, 2017

Incorrect results with Archer #42

Incorrect results with Archer #42

Comments

rolandschulz commented May 17, 2017

dongahn commented May 17, 2017

rolandschulz commented May 17, 2017

dongahn commented May 17, 2017

rolandschulz commented May 17, 2017

simoatze commented May 17, 2017