Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when running multiple servers in parallel #212

Closed
tianluyuan opened this issue Aug 24, 2023 · 2 comments
Closed

Issue when running multiple servers in parallel #212

tianluyuan opened this issue Aug 24, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@tianluyuan
Copy link
Contributor

Possibly related to #200, but I think this is a different bug.

When running a bunch of parallel servers manually, with xargs -P, I am finding that often a large fraction fail substantially. When I rerun a failed scan by itself, it seems to give a much more reasonable result.

When I dig into the results a bit, I find that the parallel scan gives some pixels with unreasonably low llhs (sometimes 0). It's also possible to compare the results to the rerun, standalone scan and I see that some pixels are identical (e.g. 3 below) but then the parallel run will yield some substantially lower llh value (e.g. pixel 0 for the y.result is lower than for x.result below).

In [27]: y.result['nside-8'][:3]
Out[27]: 
array([(0, 3825.3995561 , 235051.68226907, 239625.45479989),
       (3, 4585.14457398,  96840.76958162, 103768.83013312),
       (4, 3824.43218579, 236898.32821561, 243579.87750991)],
      dtype=[('index', '<i8'), ('llh', '<f8'), ('E_in', '<f8'), ('E_tot', '<f8')])

In [28]: x.result['nside-8'][:3]
Out[28]: 
array([(0, 4585.16535784,   96293.03433672,  101469.4216979 ),
       (1, 4678.94490939, 1307900.34804634, 1307900.34804634),
       (3, 4585.14457398,   96840.76958162,  103768.83013312)],
      dtype=[('index', '<i8'), ('llh', '<f8'), ('E_in', '<f8'), ('E_tot', '<f8')])

The fact that a standalone scan results in meaningful results makes me think this is not caused by the reconstruction, but that instead there may be some differences in the data itself. However, it's hard to debug as the condor output files are empty, and the error files do not help in tracking this down.

@tianluyuan tianluyuan added the bug Something isn't working label Aug 24, 2023
@tianluyuan
Copy link
Contributor Author

Testing with a stagger of starting the server jobs of 10s seems to lead to sane results overall jobs, so it might have something to do with the servers starting up all at once.

@ric-evans
Copy link
Member

I'm not sure how you were running the servers. But, since we run servers in isolated containers for testing and in skydriver, there may be unknown consequences of not doing so. Moving forward, this shouldn't affect skydriver. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants