You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Possibly related to #200, but I think this is a different bug.
When running a bunch of parallel servers manually, with xargs -P, I am finding that often a large fraction fail substantially. When I rerun a failed scan by itself, it seems to give a much more reasonable result.
When I dig into the results a bit, I find that the parallel scan gives some pixels with unreasonably low llhs (sometimes 0). It's also possible to compare the results to the rerun, standalone scan and I see that some pixels are identical (e.g. 3 below) but then the parallel run will yield some substantially lower llh value (e.g. pixel 0 for the y.result is lower than for x.result below).
The fact that a standalone scan results in meaningful results makes me think this is not caused by the reconstruction, but that instead there may be some differences in the data itself. However, it's hard to debug as the condor output files are empty, and the error files do not help in tracking this down.
The text was updated successfully, but these errors were encountered:
Testing with a stagger of starting the server jobs of 10s seems to lead to sane results overall jobs, so it might have something to do with the servers starting up all at once.
I'm not sure how you were running the servers. But, since we run servers in isolated containers for testing and in skydriver, there may be unknown consequences of not doing so. Moving forward, this shouldn't affect skydriver. Closing
Possibly related to #200, but I think this is a different bug.
When running a bunch of parallel servers manually, with
xargs -P
, I am finding that often a large fraction fail substantially. When I rerun a failed scan by itself, it seems to give a much more reasonable result.When I dig into the results a bit, I find that the parallel scan gives some pixels with unreasonably low llhs (sometimes 0). It's also possible to compare the results to the rerun, standalone scan and I see that some pixels are identical (e.g.
3
below) but then the parallel run will yield some substantially lower llh value (e.g. pixel0
for they.result
is lower than forx.result
below).The fact that a standalone scan results in meaningful results makes me think this is not caused by the reconstruction, but that instead there may be some differences in the data itself. However, it's hard to debug as the condor output files are empty, and the error files do not help in tracking this down.
The text was updated successfully, but these errors were encountered: