Issue when running multiple servers in parallel #212

tianluyuan · 2023-08-24T17:22:16Z

Possibly related to #200, but I think this is a different bug.

When running a bunch of parallel servers manually, with xargs -P, I am finding that often a large fraction fail substantially. When I rerun a failed scan by itself, it seems to give a much more reasonable result.

When I dig into the results a bit, I find that the parallel scan gives some pixels with unreasonably low llhs (sometimes 0). It's also possible to compare the results to the rerun, standalone scan and I see that some pixels are identical (e.g. 3 below) but then the parallel run will yield some substantially lower llh value (e.g. pixel 0 for the y.result is lower than for x.result below).

In [27]: y.result['nside-8'][:3]
Out[27]: 
array([(0, 3825.3995561 , 235051.68226907, 239625.45479989),
       (3, 4585.14457398,  96840.76958162, 103768.83013312),
       (4, 3824.43218579, 236898.32821561, 243579.87750991)],
      dtype=[('index', '<i8'), ('llh', '<f8'), ('E_in', '<f8'), ('E_tot', '<f8')])

In [28]: x.result['nside-8'][:3]
Out[28]: 
array([(0, 4585.16535784,   96293.03433672,  101469.4216979 ),
       (1, 4678.94490939, 1307900.34804634, 1307900.34804634),
       (3, 4585.14457398,   96840.76958162,  103768.83013312)],
      dtype=[('index', '<i8'), ('llh', '<f8'), ('E_in', '<f8'), ('E_tot', '<f8')])

The fact that a standalone scan results in meaningful results makes me think this is not caused by the reconstruction, but that instead there may be some differences in the data itself. However, it's hard to debug as the condor output files are empty, and the error files do not help in tracking this down.

The text was updated successfully, but these errors were encountered:

tianluyuan · 2023-08-24T20:22:57Z

Testing with a stagger of starting the server jobs of 10s seems to lead to sane results overall jobs, so it might have something to do with the servers starting up all at once.

ric-evans · 2023-11-29T21:19:35Z

I'm not sure how you were running the servers. But, since we run servers in isolated containers for testing and in skydriver, there may be unknown consequences of not doing so. Moving forward, this shouldn't affect skydriver. Closing

tianluyuan added the bug Something isn't working label Aug 24, 2023

ric-evans closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when running multiple servers in parallel #212

Issue when running multiple servers in parallel #212

tianluyuan commented Aug 24, 2023

tianluyuan commented Aug 24, 2023

ric-evans commented Nov 29, 2023

Issue when running multiple servers in parallel #212

Issue when running multiple servers in parallel #212

Comments

tianluyuan commented Aug 24, 2023

tianluyuan commented Aug 24, 2023

ric-evans commented Nov 29, 2023