Skip to content

Non-reproducibility in PF GPU clustering in 12834.42[23] workflows #47233

@makortel

Description

@makortel
Contributor

PR GPU tests in #47226 (comment) showed differences in workflow 12834.422 in ParticleFlow/PFClusterV such as

Image

And in workflow 12834.423 in ParticleFlow/PFClusterV as in above and also in ParticleFlow/pfClusterHBHEAlpakaV such as

Image

Activity

makortel

makortel commented on Jan 31, 2025

@makortel
ContributorAuthor

assign heterogeneous

makortel

makortel commented on Jan 31, 2025

@makortel
ContributorAuthor
cmsbuild

cmsbuild commented on Jan 31, 2025

@cmsbuild
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild

cmsbuild commented on Jan 31, 2025

@cmsbuild
Contributor

cms-bot internal usage

cmsbuild

cmsbuild commented on Jan 31, 2025

@cmsbuild
Contributor

A new Issue was created by @makortel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel

makortel commented on Jan 31, 2025

@makortel
ContributorAuthor

#47227 (comment) showed in 12834.422 ParticleFlow/PFClusterV different differences

Image

jsamudio

jsamudio commented on Jan 31, 2025

@jsamudio
Contributor

I believe the plots in ParticleFlow/pfClusterHBHEAlpakaV look about as expected between the legacy format and the Alpaka version. IIRC and if the configuration in github is to be believed then this is the comparison it is doing. See slide 7 in http://cds.cern.ch/record/2898660. For plots in ParticleFlow/PFClusterV, I am not so sure.

makortel

makortel commented on Jan 31, 2025

@makortel
ContributorAuthor

I believe the plots in ParticleFlow/pfClusterHBHEAlpakaV look about as expected between the legacy format and the Alpaka version

The problem is that these plots are different between different executions (whatever that meant in the tests of those PRs). The RelMon doesn't seem to show very well what the differences really are for 2D plots, such as pfCluster_RecHitMultiplicity_GPUvsCPU, but at least the number of entries seems to be different (1376 vs 1161).

jsamudio

jsamudio commented on Jan 31, 2025

@jsamudio
Contributor

I believe the plots in ParticleFlow/pfClusterHBHEAlpakaV look about as expected between the legacy format and the Alpaka version

The problem is that these plots are different between different executions (whatever that meant in the tests of those PRs). The RelMon doesn't seem to show very well what the differences really are for 2D plots, such as pfCluster_RecHitMultiplicity_GPUvsCPU, but at least the number of entries seems to be different (1376 vs 1161).

Okay I see now. I think I need to do some testing.

jsamudio

jsamudio commented on Feb 2, 2025

@jsamudio
Contributor

Okay, I looked just a bit deeper and found where the discrepancy is occurring for at least ParticleFlow/pfClusterHBHEAlpakaV. I tested on CMSSW_15_0_X_2025-01-31-2300 with and without the changes from #47226. What I am seeing is that once #47226 is implemented, every HCAL rechit in the first 2 events trigger the if statement here:

// skip bad channels
if (rh.chi2() < 0)
return false;

Since there are no valid HCAL rechits, the PF rechit collection is empty and the clustering does not run on these two events (seen in the screenshot).

Image

Looking at the values of rh.chi2() on the GPU they all appear to be -0.0. I checked the energies as well and those at least look reasonable. I am not sure if the collection is somehow corrupted or if the HCAL rechit chi2 calculation is not functioning properly.

makortel

makortel commented on Feb 3, 2025

@makortel
ContributorAuthor

Thanks @jsamudio! I'll take a deeper look on #47226 as well.

makortel

makortel commented on Feb 4, 2025

@makortel
ContributorAuthor

Couple of observations of the step3 of 12834.423

  • module pfRecHitSoAProducerHBHEOnly (of type PFRecHitSoAProducerHCAL) consumes PortableDeviceCollection<hcal::HcalRecHitSoALayout> from module hbheOnlyRecHitToSoA
  • hbheOnlyRecHitToSoA (HCALRecHitSoAProducer) consumes edm::SortedCollection<HBHERecHit> from hbhereco
  • hbhereco (HcalRecHitSoAToLegacy via SwithProducer) consumes PortableHostCollection<hcal::HcalRecHitSoALayout> from hbheRecHitProducerPortable
  • hbheRecHitProducerPortable (HBHERecHitProducerPortable) consumes PortableDeviceCollection<hcal::HcalPhase{0,1}DigiSoALayout> from hcalDigisPortable
  • hcalDigisPortable is HcalDigisSoAProducer that is changed in Migrate BeamSpotDeviceProducer and HcalDigisSoAProducer to rely on implicit host-to-device copy #47226

I have a vague recollection there was an idea to make the PF use more or less directly the HCAL RecHits on the GPU memory, without going back-and-fort with CPU and legacy collection. Is this still the plan?

makortel

makortel commented on Feb 4, 2025

@makortel
ContributorAuthor

All my tests so far suggest the output of hbheRecHitProducerPortable is the same with and without #47226.

makortel

makortel commented on Feb 4, 2025

@makortel
ContributorAuthor

Umm, the chi2 is not copied in

template <>
void CaloRecHitSoAProducer<HCAL>::convertRecHit(hcal::RecHitHostCollection::View::element to,
const HCAL::CaloRecHitType& from) {
// Fill SoA from HCAL rec hit
to.detId() = from.id().rawId();
to.energy() = from.energy();
to.timeM0() = from.time();
}

so in this workflow the chi2() in
// skip bad channels
if (rh.chi2() < 0)
return false;

reads uninitialized data (and its behavior is thus undefined).

makortel

makortel commented on Feb 4, 2025

@makortel
ContributorAuthor

With #47256 I see no longer differences between with and without #47226

fwyzard

fwyzard commented on Feb 4, 2025

@fwyzard
Contributor

I have a vague recollection there was an idea to make the PF use more or less directly the HCAL RecHits on the GPU memory, without going back-and-fort with CPU and legacy collection. Is this still the plan?

We are already doing that in the HLT menu.

I guess the offline workflow needs to be update ?

makortel

makortel commented on Feb 5, 2025

@makortel
ContributorAuthor

I guess the offline workflow needs to be update ?

Do we want to follow-up this question in this issue, in a separate issue, or is this reminder enough?

makortel

makortel commented on Feb 5, 2025

@makortel
ContributorAuthor

I guess the offline workflow needs to be update ?

Do we want to follow-up this question in this issue, in a separate issue, or is this reminder enough?

Ah, I see #47263 now, nevermind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @makortel@fwyzard@cmsbuild@jsamudio

      Issue actions

        Non-reproducibility in PF GPU clustering in 12834.42[23] workflows · Issue #47233 · cms-sw/cmssw