-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging simulation files loses the information on total number of simulated events #177
Comments
Do you mean the |
Hi @Elisa-Visentin , In lstchain, the n(um)_showers parameter is stacked during merging. Which should be correct since it indeed represents the total number of individual event simulated. Another possibility may be to store the simulation config of all files instead of only the first one. I saw that somewhere once (forgot when). |
If we stack the num_showers while merging (by adding over the files or by multiplying for the number of files) we should also change the other informations in the simulation config (i.e., average or sum of the other parameters over the files): keeping all but one parameters equal to the ones of the first file and stacking the number of showers could be a bit misleading (in particular if you use processed data without knowing the code details). Not sure about storing one line per file (I should check the whole code, but the file must be read appropriately everywhere). BTW, you must merge DL1, or the code changes will be useless (maybe to be added in the doc). And, according to the chosen parameters, some runs could not survive even the dl0-dl1 step🤔 (so you should count the number of simtel files) |
"we should also change the other informations in the simulation config" If we want to have other settings changing I think the only option would be to store all the (unique) configs (stacking the num_showers). But it would also require to find a way to tag events associated to a given config. Maybe by adding the obs_id associated to each. But this is way more complex and maybe not that useful.
Right, we would need to either do that or keep files with 0 events... |
To evaluate sensitivity you use a (set of) merged file(s) (one per pointing direction): I think the info about the obs_id stored in a single run is not needed because you will use all the runs (so, no matter if run 1 has 2000 showers and run 2 1000, after you merge them you will have 3000 simulated showers). I should check the configuration entries: if none of them (except num_showers) depends on the run/number of runs, we can change num_showers while iterating over the files to store the total number of simulated showers, otherwise we can do this but changing the other parameters too (e.g., if stored, the area over which showers are simulated, which could be an average over the runs). As for files with 0 events, we should check what can happen in the merge (are they skipped or not? any crash?). And we should evaluate the needed space and number of files: empty tables won't be 'huge', hopefully, but maybe it is not worth to store them, maybe we can just create (somewhere in the analysis folder) a summary file where a few numbers are stored during the analysis (number of DL0 runs, runs lost due to cuts, runs lost due to bugs/crashes...), maybe we can just count the number of DL0s (can change over time, so maybe we should use the number of Dl1 log files) when needed (manual, boring solution) |
Yes I really would avoid averaging. As it would lead to incorrect relative weights on events simulated with different conditions. e.g. if the energy range changes we can ignore events with energy below the average minimum. At this point it would be preferable to keep separated files in my opinion. I don't see a perfect solution to handle 0 events file (especially if it occurs before DL1). Except having an additional log file/database keeping tabs on the number of successfully simulated simtel files, and transitions to higher levels. |
Current hdf5 file merging script copies the simulation information of only the first file (and assumes it is consistent with following files). This can be error inducing but is coherent with current productions.
Then in usage the number of simulation file is needed to correctly weight MC data. It is thus extracted from events list by finding unique 'obs_id'.
This can lead to underestimated results if the number of events per simulation file is low (and thus cuts remove some obs_id fully).
Example : processing of Helium from this production: /fefs/aswg/data/mc/DL0/LSTProd2/TestDataset/Helium/sim_telarray/node_corsika_theta_68.284_az_3*
More than 10% obs_id are removed during the DL1->DL1_stereo step. With an 'intensity > 50 p.e.' quality cut applied.
The text was updated successfully, but these errors were encountered: