Instability of some numerical results (condor vs CI) #200

mlincett · 2023-06-09T10:02:42Z

As we have been discussing for a while, it seems sometimes the numerical results of millipede are not stable across runs on different platforms, in spite of containerisation.

While I am not sure there is an "issue" to solve, I would like to track here some observations about this behaviour. Updates to come.

mlincett · 2023-06-09T10:31:32Z

After #198 the test of MillipedeWilks fails. A single pixel that fails the energy loss reco shows a difference in likelihood of 17% between condor and CI.

ric-evans · 2023-06-09T22:06:15Z

this also seems relevant: #60

tianluyuan · 2023-07-17T17:18:09Z

Could you clarify where the failure occurred? Naively I would have thought when run on GitHub CI the platform should be near identical, so tests should fail randomly. Though results from condor could still be different.

ric-evans · 2023-07-17T21:47:49Z

Could you clarify where the failure occurred? Naively I would have thought when run on GitHub CI the platform should be near identical, so tests should fail randomly. Though results from condor could still be different.

We're seeing it here #204 with millipede wilks

ric-evans · 2023-07-18T23:41:29Z

Following up, the issue in #204 is not related to this after all

tianluyuan · 2023-09-19T15:06:04Z

some bookkeeping of sporadic test failures, cause still unknown

https://github.com/icecube/skymap_scanner/actions/runs/6180290095/job/16776550606
https://github.com/icecube/skymap_scanner/actions/runs/6229819722/job/16909362471
https://github.com/icecube/skymap_scanner/actions/runs/6229819722/job/16909363455

tianluyuan · 2023-09-19T15:39:32Z

The spline tables for all these cases are obtained using wget over http. One possibility is that in those cases the integrity of the file is lost, leading to the different results. Perhaps we can add checksumming to ensure that is not happening.

ric-evans · 2023-09-21T15:41:55Z

The spline tables for all these cases are obtained using wget over http. One possibility is that in those cases the integrity of the file is lost, leading to the different results. Perhaps we can add checksumming to ensure that is not happening.

This is a good idea if we continue to rely on remote storage. What do you think in regards to #166?

tianluyuan · 2023-09-21T19:29:16Z

I think the containers can be dataless and cvmfs-less if we really wanted, but in that case we should ensure the file transfer mechanism is robust. I don't think I have run into such issues on sub-2, but that could be because it's using the spline tables on cvmfs rather than over http.

mlincett · 2023-09-21T21:10:35Z

I believe issues in downloads should result in truncated files, rather than corrupted data, but a checksum is definitely a good idea.

dsschult · 2023-09-22T14:27:55Z

I've seen corrupted data without a truncation when transferring files. It's a low probability to get a bit flip that passes the TCP checksum, but it does happen when moving around enough bytes. Note that this does not happen with CVMFS, since that uses checksums internally.

The easiest thing to do is to have a checksum file next to the file you want to download, and if it's available then perform the checksum test.

tianluyuan · 2023-10-13T16:53:50Z

The numerical issues in the tests are a quite confounding. They seem to oscillate between two results for millipede_wilks. E.g. here and here.

Basically for what appears to be the same seed particle the millipede unfolded particle can be different. Checking the OS indicates the CI runners are on the same OS (ubuntu_amd64). It doesn't rule out data corruption over wget, but it appears not completely random either.

tianluyuan · 2023-10-16T16:21:03Z

It looks like numpy picks up avx512 on select processors

tyuan@cobalt06:~$ lscpu|head   
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
CPU family:                      6
Model:                           45
tyuan@cobalt06:~$ python3 -c 'import numpy; numpy.show_config()'|tail
lapack_opt_info:
    libraries = ['lapack', 'lapack', 'blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/local/include', '/usr/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX
    not found = F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

vs

Singularity> lscpu|head                                    
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Vendor ID:           GenuineIntel
Model name:          Intel Xeon Processor (Cascadelake)
CPU family:          6
Model:               85
Singularity> python3 -c 'import numpy; numpy.show_config()'|tail
lapack_opt_info:
    libraries = ['lapack', 'lapack', 'blas', 'blas']
    library_dirs = ['/usr/lib/x86_64-linux-gnu']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/usr/local/include', '/usr/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_KNL,AVX512_KNM,AVX512_CNL,AVX512_ICL

could explain why splinempe and wilks are the ones to fail.

tianluyuan · 2023-10-16T19:39:36Z

More testing indicates it's not something in numpy/python/seeding but in minimization/fitting with millipede (and possibly splinempe?). See discussion on slack for comparisons

tianluyuan · 2023-10-16T21:13:49Z

Possibly relevant numpy/numpy#23523

dsschult · 2023-10-16T21:44:36Z

That reminds me that you can do this to force different optimizations with OpenBLAS:
https://github.com/OpenMathLib/OpenBLAS/wiki/Faq#choose_target_dynamic

For avx2:

export OPENBLAS_CORETYPE=Haswell

For avx:

export OPENBLAS_CORETYPE=Sandybridge

tianluyuan · 2023-10-17T00:22:41Z

That did it for avx2. This matches what I'm seeing on NPX with has_avx2

Singularity> lscpu |head && python3 mwe.py 2>/dev/null && OPENBLAS_CORETYPE=Haswell python3 mwe.py 2>/dev/null
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 48 bits virtual
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Vendor ID:           GenuineIntel
Model name:          Intel Xeon Processor (Cascadelake)
CPU family:          6
Model:               85
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107588.9842 Edm =       10725.01483 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1377 Edm =       10724.80881 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

Setting to Sandybridge does not recover what I get on cobalt6 though

Singularity> OPENBLAS_CORETYPE=Sandybridge python3 mwe.py 2>/dev/null
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.2472 Edm =       10724.75928 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

tianluyuan · 2023-10-17T15:44:19Z

Testing on AMD chip with avx512 indicates that it's default is equivalent to Haswell, which is avx2

tyuan@n-165:scan$ lscpu|head && python3 mwe.py 2>/dev/null && OPENBLAS_CORETYPE=Haswell python3 mwe.py 2>/dev/null                                                                    
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   52 bits physical, 57 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       AuthenticAMD
Model name:                      AMD EPYC 9334 32-Core Processor
CPU family:                      25
Model:                           17
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1377 Edm =       10724.80881 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1377 Edm =       10724.80881 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

tianluyuan · 2023-10-17T15:46:58Z

Testing a variety of OpenBlas flags on Model name: Intel Xeon Processor (Cascadelake) results in the following, none of which matches cobalt06

  Core2                                                                                                                                                                        [46/271]
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Banias                                                                                                                                                                               
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Penryn                                                                                                                                                                               
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Dunnington                                                                                                                                                                           
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Opteron_SSE3                                                                                                                                                                         
bash: line 1:  1864 Illegal instruction     OPENBLAS_CORETYPE=Opteron_SSE3 python3 mwe.py 2> /dev/null                                                                                 
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Katmai                                                                                                                                                                               
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5                                                                                                
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8                                                                                       
W SimplexBuilder Simplex did not converge, #fcn calls exhausted                                                                                                                        
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)                                                                                                                   
  Coppermine                                                                                                                                                                           
output of refine_vertex_time: 10103.996258527779                                                                                                                                       
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Northwood                                  
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Prescott                                   
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Atom
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1233 Edm =       10724.83424 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Nehalem
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.0994 Edm =       10724.80926 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Barcelona
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Nano
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Bobcat
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1096 Edm =       10724.79906 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Sandybridge
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.2472 Edm =       10724.75928 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Bulldozer
bash: line 1:  2310 Illegal instruction     OPENBLAS_CORETYPE=Bulldozer python3 mwe.py 2> /dev/null
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Piledriver
bash: line 1:  2345 Illegal instruction     OPENBLAS_CORETYPE=Piledriver python3 mwe.py 2> /dev/null
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Steamroller
bash: line 1:  2380 Illegal instruction     OPENBLAS_CORETYPE=Steamroller python3 mwe.py 2> /dev/null
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
  Excavator
bash: line 1:  2414 Illegal instruction     OPENBLAS_CORETYPE=Excavator python3 mwe.py 2> /dev/null

Cobalt06

tyuan@cobalt06:scan$ lscpu |head && python3 mwe.py 2>/dev/null
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
CPU family:                      6
Model:                           45
output of refine_vertex_time: 10103.996258527779
I SimplexBuilder    0 - FCN =       107589.1115 Edm =        10724.9067 NCalls =      5
I SimplexBuilder Final iteration FCN =       88871.51294 Edm =       27488.76717 NCalls =      8
W SimplexBuilder Simplex did not converge, #fcn calls exhausted

ric-evans added question Further information is requested CI / Testing About CI and/or testing prod concern a problem with running at scale labels Jun 9, 2023

mlincett mentioned this issue Jul 17, 2023

Write Results & Manifest to File Throughout Scan #204

Merged

tianluyuan mentioned this issue Aug 24, 2023

Issue when running multiple servers in parallel #212

Closed

tianluyuan linked a pull request Oct 14, 2023 that will close this issue

debugging tests #227

Merged

tianluyuan closed this as completed in #227 Oct 24, 2023

ric-evans mentioned this issue Nov 29, 2023

Get Relative Tolerances for Pixel Datapoints from Production #60

Closed

samuelklee mentioned this issue Feb 8, 2024

Fixed MakeHelperModel test nondeterminism by disabling AVX512 with a Docker environment variable. broadinstitute/kage-lite-development#26

Merged

tianluyuan mentioned this issue Oct 4, 2024

SkyScanResult tolerances icecube/skyreader#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instability of some numerical results (condor vs CI) #200

Instability of some numerical results (condor vs CI) #200

mlincett commented Jun 9, 2023

mlincett commented Jun 9, 2023 •

edited

Loading

ric-evans commented Jun 9, 2023

tianluyuan commented Jul 17, 2023

ric-evans commented Jul 17, 2023

ric-evans commented Jul 18, 2023

tianluyuan commented Sep 19, 2023

tianluyuan commented Sep 19, 2023

ric-evans commented Sep 21, 2023 •

edited

Loading

tianluyuan commented Sep 21, 2023

mlincett commented Sep 21, 2023

dsschult commented Sep 22, 2023

tianluyuan commented Oct 13, 2023

tianluyuan commented Oct 16, 2023

tianluyuan commented Oct 16, 2023

tianluyuan commented Oct 16, 2023

dsschult commented Oct 16, 2023

tianluyuan commented Oct 17, 2023

tianluyuan commented Oct 17, 2023

tianluyuan commented Oct 17, 2023

Instability of some numerical results (condor vs CI) #200

Instability of some numerical results (condor vs CI) #200

Comments

mlincett commented Jun 9, 2023

mlincett commented Jun 9, 2023 • edited Loading

ric-evans commented Jun 9, 2023

tianluyuan commented Jul 17, 2023

ric-evans commented Jul 17, 2023

ric-evans commented Jul 18, 2023

tianluyuan commented Sep 19, 2023

tianluyuan commented Sep 19, 2023

ric-evans commented Sep 21, 2023 • edited Loading

tianluyuan commented Sep 21, 2023

mlincett commented Sep 21, 2023

dsschult commented Sep 22, 2023

tianluyuan commented Oct 13, 2023

tianluyuan commented Oct 16, 2023

tianluyuan commented Oct 16, 2023

tianluyuan commented Oct 16, 2023

dsschult commented Oct 16, 2023

tianluyuan commented Oct 17, 2023

tianluyuan commented Oct 17, 2023

tianluyuan commented Oct 17, 2023

mlincett commented Jun 9, 2023 •

edited

Loading

ric-evans commented Sep 21, 2023 •

edited

Loading