Alternative variable binning approach #849

JanWeldert · 2025-02-14T13:13:40Z

Similar to #835 this PR introduces the option to use different (regular) binnings in an analysis.
Which events use which binning depends on a separate variable called cut_var. This can be for example the pid value but also the number of hit modules.
I tried to modify as little code as possible but also provide all necessary changes to use the new binning type. An example notebook is also provided. This PR introduces a new binning class VarBinning which basically just holds multiple MultiDimBinning objects and one OneDimBinning which represents the cut variable. The main change when using the VarBinning class is that the histogramming is not happening in the dedicated stage but in the output function of the pipeline. Consequently, a pipeline using VarBinning can not have a hist stage.
The way a VarBinning is defined is by passing a list in the binning config file.

thehrh · 2025-02-20T12:46:05Z

Before doing a more detailed review, let me try to summarise a few key aspects of this PR:

one binning dimension can now also serve as variable dependent on which the binnings in the remaining dimensions change (still always hyperrectangular unless masking is used, same syntax as dependent binnings: one universal mask or list of masks); it seems unlikely that more than one such distinguished dimension will be required in practice
in contrast to Introducing support for variable binning with event classes (species) #835, we are not introducing yet another events class and need no "event species names" (this role is played by the bins in the distinguished dimension)
the modifications of parse_pipeline_config, allowing it to instantiate the new binning_dict entry, are moderate in both cases
Container finally implements the "sum" translation mode for histogramming events (low effort, uses the pre-existing array_to_binned method)
this mode replaces a pipeline's utils.hist service when a VarBinning is set as the output_binning
- all events are jointly processed by the pipeline: only at the end of _get_outputs are they split into separate ContainerSets, one per bin in the distinguished dimension, and histogrammed (using the now built-in functionality) according to the appropriate MultiDimBinning in the remaining dimensions
- accordingly, a list of MapSets is returned, requiring minor modifications to the DistributionMaker class
a major benefit with respect to Introducing support for variable binning with event classes (species) #835 is the removed need for many invasive and often boilerplate changes to Map functions, whose thoroughness only emphasises the large maintenance burden they are accompanied by
- Possibly, if the binning had been more flexible from scratch, the class would have been integrated into Map like this.
- But the solution in this PR shows that a Map needn't be aware of this type of variable binning for conducting statistical analysis. Instead, only fairly few simple additions to the Analysis class are required, similar to what we already have there when we are distinguishing between a DistributionMaker and a Detectors instance.
- The drawback is of course that users who are e.g. interactively computing metric values will have to manually perform the sum in the same way as fit_recursively does it.

In conclusion, I am highly in favour of the approach proposed in this PR over that in #835.

marialiubarska

Hi, sorry for being very late to this discussion. I think the code looks good and it is a good option for a variable binning with one "split" variable.

I understand that #835 introduces a lot of changes. While I tried to avoid affecting existing functionality as much a possible, I will fully understand if people don't feel comfortable pushing it to main and prefer adding this version instead.

In my case I specifically needed to introduce arbitrary cuts for classification, so I don't think this solution would be suitable for my analysis. However, since I might be the only person who needs this functionality for now, I would not have a problem working in separate branch.

thehrh · 2025-02-21T10:09:31Z

Hi Maria, a set of n "arbitrary" cuts/selection criteria can be represented by a one-dimensional binning too, can't it? You would just have to evaluate which one of your cuts each event satisfies and add one unique number per such cut to each event in the the events file before running the pipeline, then define the one-dimensional split bin edges accordingly. This way, double counting by PISA wouldn't even be a concern (you as analyser would need to make sure cuts are mutually exclusive anyway I presume).

JanWeldert added 5 commits January 3, 2025 09:42

Add variable binning and adjust output functions

fcae83c

Force events representation and add histogramming to translations

8c20429

Adjust analysis.py and config parser

35dcf0e

Add example

81c308b

Expand example a bit

eaf6235

JanWeldert requested review from thehrh and marialiubarska February 14, 2025 13:13

fit_hypo does not know hypo_asimov_dist

3215d55

thehrh mentioned this pull request Feb 20, 2025

Stage/service to define different binning for analysis #108

Open

thehrh added this to the PISA 4.2 milestone Feb 20, 2025

marialiubarska approved these changes Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative variable binning approach #849

Alternative variable binning approach #849

JanWeldert commented Feb 14, 2025

thehrh commented Feb 20, 2025 •

edited

Loading

marialiubarska left a comment

thehrh commented Feb 21, 2025 •

edited

Loading

Alternative variable binning approach #849

Are you sure you want to change the base?

Alternative variable binning approach #849

Conversation

JanWeldert commented Feb 14, 2025

thehrh commented Feb 20, 2025 • edited Loading

marialiubarska left a comment

Choose a reason for hiding this comment

thehrh commented Feb 21, 2025 • edited Loading

thehrh commented Feb 20, 2025 •

edited

Loading

thehrh commented Feb 21, 2025 •

edited

Loading