You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe the plots in ParticleFlow/pfClusterHBHEAlpakaV look about as expected between the legacy format and the Alpaka version. IIRC and if the configuration in github is to be believed then this is the comparison it is doing. See slide 7 in http://cds.cern.ch/record/2898660. For plots in ParticleFlow/PFClusterV, I am not so sure.
I believe the plots in ParticleFlow/pfClusterHBHEAlpakaV look about as expected between the legacy format and the Alpaka version
The problem is that these plots are different between different executions (whatever that meant in the tests of those PRs). The RelMon doesn't seem to show very well what the differences really are for 2D plots, such as pfCluster_RecHitMultiplicity_GPUvsCPU, but at least the number of entries seems to be different (1376 vs 1161).
I believe the plots in ParticleFlow/pfClusterHBHEAlpakaV look about as expected between the legacy format and the Alpaka version
The problem is that these plots are different between different executions (whatever that meant in the tests of those PRs). The RelMon doesn't seem to show very well what the differences really are for 2D plots, such as pfCluster_RecHitMultiplicity_GPUvsCPU, but at least the number of entries seems to be different (1376 vs 1161).
Okay I see now. I think I need to do some testing.
Okay, I looked just a bit deeper and found where the discrepancy is occurring for at least ParticleFlow/pfClusterHBHEAlpakaV. I tested on CMSSW_15_0_X_2025-01-31-2300 with and without the changes from #47226. What I am seeing is that once #47226 is implemented, every HCAL rechit in the first 2 events trigger the if statement here:
Since there are no valid HCAL rechits, the PF rechit collection is empty and the clustering does not run on these two events (seen in the screenshot).
Looking at the values of rh.chi2() on the GPU they all appear to be -0.0. I checked the energies as well and those at least look reasonable. I am not sure if the collection is somehow corrupted or if the HCAL rechit chi2 calculation is not functioning properly.
module pfRecHitSoAProducerHBHEOnly (of type PFRecHitSoAProducerHCAL) consumes PortableDeviceCollection<hcal::HcalRecHitSoALayout> from module hbheOnlyRecHitToSoA
hbheOnlyRecHitToSoA (HCALRecHitSoAProducer) consumes edm::SortedCollection<HBHERecHit> from hbhereco
hbhereco (HcalRecHitSoAToLegacy via SwithProducer) consumes PortableHostCollection<hcal::HcalRecHitSoALayout> from hbheRecHitProducerPortable
hbheRecHitProducerPortable (HBHERecHitProducerPortable) consumes PortableDeviceCollection<hcal::HcalPhase{0,1}DigiSoALayout> from hcalDigisPortable
I have a vague recollection there was an idea to make the PF use more or less directly the HCAL RecHits on the GPU memory, without going back-and-fort with CPU and legacy collection. Is this still the plan?
I have a vague recollection there was an idea to make the PF use more or less directly the HCAL RecHits on the GPU memory, without going back-and-fort with CPU and legacy collection. Is this still the plan?
Activity
makortel commentedon Jan 31, 2025
assign heterogeneous
makortel commentedon Jan 31, 2025
@jsamudio
cmsbuild commentedon Jan 31, 2025
New categories assigned: heterogeneous
@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks
cmsbuild commentedon Jan 31, 2025
cms-bot internal usage
cmsbuild commentedon Jan 31, 2025
A new Issue was created by @makortel.
@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
makortel commentedon Jan 31, 2025
#47227 (comment) showed in 12834.422
ParticleFlow/PFClusterVdifferent differencesjsamudio commentedon Jan 31, 2025
I believe the plots in
ParticleFlow/pfClusterHBHEAlpakaVlook about as expected between the legacy format and the Alpaka version. IIRC and if the configuration in github is to be believed then this is the comparison it is doing. See slide 7 in http://cds.cern.ch/record/2898660. For plots inParticleFlow/PFClusterV, I am not so sure.makortel commentedon Jan 31, 2025
The problem is that these plots are different between different executions (whatever that meant in the tests of those PRs). The RelMon doesn't seem to show very well what the differences really are for 2D plots, such as
pfCluster_RecHitMultiplicity_GPUvsCPU, but at least the number of entries seems to be different (1376 vs 1161).jsamudio commentedon Jan 31, 2025
Okay I see now. I think I need to do some testing.
jsamudio commentedon Feb 2, 2025
Okay, I looked just a bit deeper and found where the discrepancy is occurring for at least
ParticleFlow/pfClusterHBHEAlpakaV. I tested onCMSSW_15_0_X_2025-01-31-2300with and without the changes from #47226. What I am seeing is that once #47226 is implemented, every HCAL rechit in the first 2 events trigger the if statement here:cmssw/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
Lines 62 to 64 in 902cf5d
Since there are no valid HCAL rechits, the PF rechit collection is empty and the clustering does not run on these two events (seen in the screenshot).
Looking at the values of
rh.chi2()on the GPU they all appear to be -0.0. I checked the energies as well and those at least look reasonable. I am not sure if the collection is somehow corrupted or if the HCAL rechitchi2calculation is not functioning properly.makortel commentedon Feb 3, 2025
Thanks @jsamudio! I'll take a deeper look on #47226 as well.
makortel commentedon Feb 4, 2025
Couple of observations of the step3 of 12834.423
pfRecHitSoAProducerHBHEOnly(of typePFRecHitSoAProducerHCAL) consumesPortableDeviceCollection<hcal::HcalRecHitSoALayout>from modulehbheOnlyRecHitToSoAhbheOnlyRecHitToSoA(HCALRecHitSoAProducer) consumesedm::SortedCollection<HBHERecHit>fromhbherecohbhereco(HcalRecHitSoAToLegacyvia SwithProducer) consumesPortableHostCollection<hcal::HcalRecHitSoALayout>fromhbheRecHitProducerPortablehbheRecHitProducerPortable(HBHERecHitProducerPortable) consumesPortableDeviceCollection<hcal::HcalPhase{0,1}DigiSoALayout>fromhcalDigisPortablehcalDigisPortableisHcalDigisSoAProducerthat is changed in Migrate BeamSpotDeviceProducer and HcalDigisSoAProducer to rely on implicit host-to-device copy #47226I have a vague recollection there was an idea to make the PF use more or less directly the HCAL RecHits on the GPU memory, without going back-and-fort with CPU and legacy collection. Is this still the plan?
makortel commentedon Feb 4, 2025
All my tests so far suggest the output of
hbheRecHitProducerPortableis the same with and without #47226.makortel commentedon Feb 4, 2025
Umm, the
chi2is not copied incmssw/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/CaloRecHitSoAProducer.cc
Lines 73 to 80 in ddb0b6e
so in this workflow the
chi2()incmssw/RecoParticleFlow/PFRecHitProducer/plugins/alpaka/PFRecHitProducerKernel.dev.cc
Lines 62 to 64 in 902cf5d
reads uninitialized data (and its behavior is thus undefined).
makortel commentedon Feb 4, 2025
With #47256 I see no longer differences between with and without #47226
fwyzard commentedon Feb 4, 2025
We are already doing that in the HLT menu.
I guess the offline workflow needs to be update ?
makortel commentedon Feb 5, 2025
Do we want to follow-up this question in this issue, in a separate issue, or is this reminder enough?
makortel commentedon Feb 5, 2025
Ah, I see #47263 now, nevermind
ParticleFlow/PFClusterVin 12834.42[23] GPU workflows #47406