Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative variable binning approach #849

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

JanWeldert
Copy link
Collaborator

Similar to #835 this PR introduces the option to use different (regular) binnings in an analysis.
Which events use which binning depends on a separate variable called cut_var. This can be for example the pid value but also the number of hit modules.
I tried to modify as little code as possible but also provide all necessary changes to use the new binning type. An example notebook is also provided. This PR introduces a new binning class VarBinning which basically just holds multiple MultiDimBinning objects and one OneDimBinning which represents the cut variable. The main change when using the VarBinning class is that the histogramming is not happening in the dedicated stage but in the output function of the pipeline. Consequently, a pipeline using VarBinning can not have a hist stage.
The way a VarBinning is defined is by passing a list in the binning config file.

@thehrh
Copy link
Contributor

thehrh commented Feb 20, 2025

Before doing a more detailed review, let me try to summarise a few key aspects of this PR:

  • one binning dimension can now also serve as variable dependent on which the binnings in the remaining dimensions change (still always hyperrectangular unless masking is used, same syntax as dependent binnings: one universal mask or list of masks); it seems unlikely that more than one such distinguished dimension will be required in practice
  • in contrast to Introducing support for variable binning with event classes (species) #835, we are not introducing yet another events class and need no "event species names" (this role is played by the bins in the distinguished dimension)
  • the modifications of parse_pipeline_config, allowing it to instantiate the new binning_dict entry, are moderate in both cases
  • Container finally implements the "sum" translation mode for histogramming events (low effort, uses the pre-existing array_to_binned method)
  • this mode replaces a pipeline's utils.hist service when a VarBinning is set as the output_binning
    • all events are jointly processed by the pipeline: only at the end of _get_outputs are they split into separate ContainerSets, one per bin in the distinguished dimension, and histogrammed (using the now built-in functionality) according to the appropriate MultiDimBinning in the remaining dimensions
    • accordingly, a list of MapSets is returned, requiring minor modifications to the DistributionMaker class
  • a major benefit with respect to Introducing support for variable binning with event classes (species) #835 is the removed need for many invasive and often boilerplate changes to Map functions, whose thoroughness only emphasises the large maintenance burden they are accompanied by
    • Possibly, if the binning had been more flexible from scratch, the class would have been integrated into Map like this.
    • But the solution in this PR shows that a Map needn't be aware of this type of variable binning for conducting statistical analysis. Instead, only fairly few simple additions to the Analysis class are required, similar to what we already have there when we are distinguishing between a DistributionMaker and a Detectors instance.
    • The drawback is of course that users who are e.g. interactively computing metric values will have to manually perform the sum in the same way as fit_recursively does it.

In conclusion, I am highly in favour of the approach proposed in this PR over that in #835.

Copy link
Contributor

@marialiubarska marialiubarska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for being very late to this discussion. I think the code looks good and it is a good option for a variable binning with one "split" variable.

I understand that #835 introduces a lot of changes. While I tried to avoid affecting existing functionality as much a possible, I will fully understand if people don't feel comfortable pushing it to main and prefer adding this version instead.

In my case I specifically needed to introduce arbitrary cuts for classification, so I don't think this solution would be suitable for my analysis. However, since I might be the only person who needs this functionality for now, I would not have a problem working in separate branch.

@thehrh
Copy link
Contributor

thehrh commented Feb 21, 2025

Hi Maria, a set of n "arbitrary" cuts/selection criteria can be represented by a one-dimensional binning too, can't it? You would just have to evaluate which one of your cuts each event satisfies and add one unique number per such cut to each event in the the events file before running the pipeline, then define the one-dimensional split bin edges accordingly. This way, double counting by PISA wouldn't even be a concern (you as analyser would need to make sure cuts are mutually exclusive anyway I presume).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants