-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU clusterizer with neural networks #13981
Conversation
…l executions. Clusters are not getting published yet (FIXME)
…alues "GPU_proc.[setting]=...;..."
Please consider the following formatting changes to AliceO2Group#13610
Please consider the following formatting changes to AliceO2Group#13709
GPU/GPUTracking/CMakeLists.txt
Outdated
PUBLIC_LINK_LIBRARIES O2::GPUUtils | ||
O2::GPUCommon | ||
O2::ReconstructionDataFormats | ||
O2::TPCFastTransformation | ||
O2::ML | ||
PRIVATE_LINK_LIBRARIES O2::DataFormatsTPC | ||
SOURCES ${SRCS_DATATYPES}) | ||
add_compile_definitions(GPUCA_HAS_ONNX=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidrohr This is needed, otherwise the code doesn't find GPUCA_HAS_ONNX internally. E.g. tpcNNClusterer objects are not created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot globally add compile definitions, it should be bound to a target, then we have to understand why it didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found it: See the new commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But wasn't it already there in my change:
target_compile_definitions(${targetName} PRIVATE GPUCA_O2_LIB GPUCA_TPC_GEOMETRY_O2 GPUCA_HAS_ONNX=1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But GPUCA_HAS_ONNX=1 is needed in two places. Not sure which changes you refer to, but I saw it only in one place before. Maybe I missed it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I understand. You added the library O2::ML
to the GPUDataTypes library, not the the GPUTracking library, same for the first define. Can you remove both from GPUDataTypes, and add both only to GPUTracking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Marking as WIP to disable the CI until preparatory PRs are merged. |
Please consider the following formatting changes to AliceO2Group#13981
Triggered the CI to see what the build / unit tests result in with the new changes |
Error while checking build/O2/fullCI_slc9 for 78c342d at 2025-03-14 03:57:
Full log here. |
I don't like to copy&paste the toNative function. |
} else { | ||
runKernel<GPUTPCCFDeconvolution>({GetGrid(clusterer.mPmemory->counters.nPositions, lane), {iSector}}); | ||
DoDebugAndDump(RecoStep::TPCClusterFinding, 262144 << 4, clusterer, &GPUTPCClusterFinder::DumpChargeMap, *mDebugFile, "Split Charges"); | ||
runKernel<GPUTPCCFClusterizer>({GetGrid(clusterer.mPmemory->counters.nClusters, lane, GPUReconstruction::krnlDeviceType::CPU), {iSector}}, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you incorrectly added a GPUReconstruction::krnlDeviceType::CPU
option for the default code path, which has to go. Otherwise it will break clusterization on the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it also for my kernels now as they should (in the future) also run on GPU memory
@@ -1073,4 +1172,4 @@ int32_t GPUChainTracking::RunTPCClusterizer(bool synchronizeOutput) | |||
|
|||
#endif | |||
return 0; | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your editor seems to regular misformat the files removing newlines at end of files. Could you please have a look and fix this?
GPU/GPUTracking/kernels.cmake
Outdated
@@ -111,7 +114,15 @@ o2_gpu_add_kernel("GPUTPCCFNoiseSuppression, noiseSuppression" "= TPCCLUS | |||
o2_gpu_add_kernel("GPUTPCCFNoiseSuppression, updatePeaks" "= TPCCLUSTERFINDER" LB) | |||
o2_gpu_add_kernel("GPUTPCCFDeconvolution" "= TPCCLUSTERFINDER" LB) | |||
o2_gpu_add_kernel("GPUTPCCFClusterizer" "= TPCCLUSTERFINDER" LB int8_t onlyMC) | |||
o2_gpu_add_kernel("GPUTPCCFMCLabelFlattener, setRowOffsets" "= TPCCLUSTERFINDER") | |||
if(NOT ALIGPU_BUILD_TYPE STREQUAL "Standalone") | |||
o2_gpu_add_kernel("GPUTPCNNClusterizerKernels, runCfClusterizer" "= TPCNNCLUSTERFINDER" LB GPUConstantMem* processors uint8_t sector int8_t dtype int8_t onlyMC uint batchStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GPUConstantMem* processors
here seems pretty weird, that should not be there. Are you sure you need it? If yes, I need to check what is going wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I need it. When you call runKernel, the runKernelBackendInternal function is called internally (GPUReconstructionCPU.h / .cxx). This function generates a thread instance with this signature:
T::template Thread<I>(x.nBlocks, 1, iB, 0, smem, T::Processor(*mHostConstantMem)[y.index], args...);
We did not define the Processor function in my classes and the use instances of GPUTPCNNClusterizer and GPUTPCClusterFinder where just (semi-)random pointers. This lead to the pointer change inside the kernel that I mentioned to you in person. The easy way to fix this for now was to pass the GPUConstantMem instance and use the class-instances directly from there. Then everything works.
You can check GPUTPCCFClusterizer.h, there is a Processor defined. However I need both instances of GPUTPCNNClsuterizer and the normal CFClusterFinder in my kernels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you derive from GPUKernelTemplate, which has the Processor
function defined here:
https://github.com/AliceO2Group/AliceO2/blob/705ebfb083c41183183c554c0cb17a6a9423e4c5/GPU/GPUTracking/Base/GPUGeneralKernels.h#L83
Took me a while to understand... :(.
The problem is that you pass in iSector twice:
runKernel<GPUTPCNNClusterizerKernels, GPUTPCNNClusterizerKernels::determineClass1Labels>({GetGrid(iSize, lane, GPUReconstruction::krnlDeviceType::CPU), {iSector}}, processors(), iSector, clustererNN.nnClusterizerDtype, 0, batchStart);
With that, you get the iSector'th instance of processors, and in there the iSector'th instance of the clusterer.
That is indeed not obvious, I'll have to add some protection here, sorry that you run into this...
Please change the first {iSector} to krnlRunRangeNone
, then it will work without passing it in manually.
Note that passing it in manually will not work on the GPU, it must be passed in automatically since on the GPU this processors struct lies in constant memory and there is no pointer to it.
PR looks good now, except for newlines at end of files... |
This PR brings a neural network based implementation for the online GPU cluster finder. The code is parallelized for GPU application but requires the correct ONNX runtime version to run on GPUs (see alidist: alisw/alidist#5622).