Skip to content

GPU stream implementation for ONNX runtime #14117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Apr 20, 2025

Conversation

ChSonnabend
Copy link
Collaborator

No description provided.

Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, just one change needed

#else // HIP
void* GPUReconstructionHIP::getGPUPointer(void* ptr)
{
void* retVal = nullptr;
GPUChkErr(hipHostGetDevicePointer(&retVal, ptr, 0));
return retVal;
}

#ifdef GPUCA_HAS_ONNX
int32_t GPUReconstructionCUDA::SetONNXGPUStream(OrtSessionOptions* session_options, int32_t stream)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hier sollte HIP stehen statt CUDA

@@ -16,6 +16,7 @@
#include "GPUReconstructionCUDAIncludesHost.h"

#include <cuda_profiler_api.h>
#include "ML/OrtInterface.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warum brauchst du hier den Interface header?

@ChSonnabend
Copy link
Collaborator Author

ChSonnabend commented Mar 28, 2025

The actual CCDB fetching and loading of models will be adjusted in the next PR. For now, this will build and run with some (techincally unneccesary) extra processing steps for model initialization.
Triggering the CI now...

@ChSonnabend ChSonnabend marked this pull request as ready for review March 28, 2025 09:08
@ChSonnabend ChSonnabend requested a review from a team as a code owner March 28, 2025 09:08
@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 46fb1e1 at 2025-03-28 11:28:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(671): error: identifier "CreateCUDAProviderOptions" is undefined
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(678): error: identifier "UpdateCUDAProviderOptionsWithValue" is undefined
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(679): error: identifier "SessionOptionsAppendExecutionProvider_CUDA_V2" is undefined
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(682): error: identifier "ReleaseCUDAProviderOptions" is undefined
ninja: build stopped: subcommand failed.

Full log here.

@ChSonnabend ChSonnabend marked this pull request as draft March 28, 2025 12:09
@ChSonnabend
Copy link
Collaborator Author

Marking as draft until shadowProcessors are implemented

… will merge AliceO2Group#14069 to have the changes in GPUChainTrackingClusterizer.
@ChSonnabend
Copy link
Collaborator Author

Triggering the CI

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 5f741fc at 2025-04-11 15:14:

## sw/BUILD/O2-latest/log
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Full log here.

@ChSonnabend
Copy link
Collaborator Author

Interestingly the ORT variables are all OFF / 0 in the slc9 builder even though the ROCM and CUDA builds should be enabled. Are there dependencies missing in the build container?

From the previous build:

+++ export ORT_ROCM_BUILD=0 +++ ORT_ROCM_BUILD=0 +++ export ORT_CUDA_BUILD=OFF +++ ORT_CUDA_BUILD=OFF +++ export ORT_MIGRAPHX_BUILD=0 +++ ORT_MIGRAPHX_BUILD=0 +++ export ORT_TENSORRT_BUILD=0 +++ ORT_TENSORRT_BUILD=0 ++ echo 'ORT_ROCM_BUILD: 0' ORT_ROCM_BUILD: 0 ++ echo 'ORT_CUDA_BUILD: OFF' ORT_CUDA_BUILD: OFF

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 70907aa at 2025-04-11 21:25:

## sw/BUILD/O2-latest/log
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

Full log here.

@davidrohr
Copy link
Collaborator

I also saw that the ORT variables are OFF, and I cannot really understand why. However, why do we have the link failure? That seems to be a genuine error if the ORT variables are off, since you cannot expect that all dependencies are available. I.e., the software should also build correctly with GPU stuff available but ONNX ORT stuff not available.

/sw/slc9_x86-64/GCC-Toolchain/v14.2.0-alice2-1/bin/../lib/gcc/x86_64-unknown-linux-gnu/14.2.0/../../../../x86_64-unknown-linux-gnu/bin/ld: GPU/GPUTracking/Base/cuda/CMakeFiles/O2lib-GPUTrackingCUDA.dir/GPUReconstructionCUDA.cu.o:(.data.rel.ro._ZTVN2o23gpu21GPUReconstructionCUDAE[_ZTVN2o23gpu21GPUReconstructionCUDAE]+0x100): undefined reference to `o2::gpu::GPUReconstructionCUDA::SetONNXGPUStream(Ort::SessionOptions&, int, int*)'

That should be fixed in any case. Could you have a look (while the ORT variables are still OFF)?

Then for why the ORT variables are off, I have no idea. The FullCI should use this docker container: registry.cern.ch/alisw/slc9-gpu-builder @singiamtel : correct me if I am wrong. It is created from these files https://github.com/alisw/docks/tree/master/slc9-gpu-builder, so it should contain all dependencies.

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for e46cdfa at 2025-04-13 13:16:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:


## sw/BUILD/O2-full-system-test-latest/log
Detected critical problem in logfile sim.log


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
grep: error-log.txt: binary file matches
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/1324b702156a43154d5a6574c37d878d8a5f4bb6/slc9_x86-64/o2checkcode/1.0-local168/etc/modulefiles
++ cat
--

Full log here.

@davidrohr
Copy link
Collaborator

Thanks, linker issue is fixed. It now fails due to the FullCI simulation issue, which I just fixed. Next time the FullCI runs, it should pass it.

And I think I understand what happens with the ORT builds
@singiamtel @ktf : I believe we build ONNXRuntime in another container without CUDA/ROCm support, and then we push that into the binary package repository, and then when we run the FullCI, we pick up the binary package from the store that was built without ORT support, so the slc9-gpu-builder container builds O2 with an ONNXRuntime that does not have GPU support.
I asssume to fix this we have to create different package hashes depending on whether GPU support is available or not. We should discuss tomorrow at CERN, if I understand correctly that functionality should already be there.

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChSonnabend : I have a couple of mostly cosmetic comments, please have a look, besides the PR is fine with me now, and it passes the CI, now also compiling with ORT_ROCM_BUILD=1. So once you have addressed my points and you write that you are satisfied with the current state, we can merge it.

PRIVATE_INCLUDE_DIRECTORIES
${CMAKE_SOURCE_DIR}/Detectors/Base/src
${CMAKE_SOURCE_DIR}/Detectors/TRD/base/src
${CMAKE_SOURCE_DIR}/DataFormats/Reconstruction/src
${CMAKE_CURRENT_SOURCE_DIR}
TARGETVARNAME targetName)

message("Compile definitions for ONNX runtime (CUDA):")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not put all of this here, CUDA build is only done when compiling with CUDA, so the printout will only be there if we build for CUDA, which makes no sense for the ROCm variable to be honest. Do we actually need this, since we anyway have this printout in the cmake line with -DCMAKE....

In general, I would propose to do the following:

  • Remove the printouts here.
  • Create a separate PR, in which:
    • you move the include(dependencies/FindONNXRuntime.cmake) from the main CMakeLists.txt to dependencies/O2Dependencies.cmake. In there, you create a one-line printout that shows "ONNXRuntime Found: 0/1 ORT_CUDA_BUILD 0/1 ...".

target_compile_definitions(${targetName} PRIVATE
GPUCA_HAS_ONNX=1
ORT_ROCM_BUILD=$<BOOL:${ORT_ROCM_BUILD}>
ORT_CUDA_BUILD=$<BOOL:${ORT_CUDA_BUILD}>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need ROCM and MIGRAPHX compile definitions for CUDA? I assume we can remove them here, and add them only for the ROCm one?

o2_add_library(ML
SOURCES src/OrtInterface.cxx
TARGETVARNAME targetName
PRIVATE_LINK_LIBRARIES O2::Framework ONNXRuntime::ONNXRuntime)

# Pass ORT variables as a preprocessor definition
target_compile_definitions(${targetName} PRIVATE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting 0/1 definitions, I would set only the =1 definition, if the CMake variable is set.
Then, in the code further below you don't need
#if defined(FOO) && FOO=1, but you can simply use #ifdef FOO


void GPUReconstructionCUDA::SetONNXGPUStream(Ort::SessionOptions& session_options, int32_t stream, int32_t* deviceId)
{
#if defined(ORT_CUDA_BUILD) && ORT_CUDA_BUILD == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do as I wrote above, you can use a simple #ifdef here.

PRIVATE_INCLUDE_DIRECTORIES
${CMAKE_SOURCE_DIR}/Detectors/Base/src
${CMAKE_SOURCE_DIR}/Detectors/TRD/base/src
${CMAKE_SOURCE_DIR}/DataFormats/Reconstruction/src
${GPUCA_HIP_SOURCE_DIR}
TARGETVARNAME targetName)

message("Compile definitions for ONNX runtime (HIP / ROCM):")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as I commented for CUDA applies here

PUBLIC_INCLUDE_DIRECTORIES ${INCDIRS}
SOURCES ${SRCS} ${SRCS_NO_CINT} ${SRCS_NO_H})

target_include_directories(
${targetName}
PRIVATE $<TARGET_PROPERTY:O2::Framework,INTERFACE_INCLUDE_DIRECTORIES>)

target_compile_definitions(${targetName} PRIVATE GPUCA_O2_LIB GPUCA_TPC_GEOMETRY_O2 GPUCA_HAS_ONNX=1)
target_compile_definitions(${targetName} PRIVATE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't change the formatting if not needed.

@@ -42,6 +42,7 @@
#ifdef GPUCA_HAS_ONNX
#include "GPUTPCNNClusterizerKernels.h"
#include "GPUTPCNNClusterizerHost.h"
// #include "ML/3rdparty/GPUORTFloat16.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you add a commented file?

@@ -118,6 +119,7 @@ GPURecoWorkflowSpec::GPURecoWorkflowSpec(GPURecoWorkflowSpec::CompletionPolicyDa
mConfig.reset(new GPUO2InterfaceConfiguration);
mConfParam.reset(new GPUSettingsO2);
mTFSettings.reset(new GPUSettingsTF);
mNNClusterizerSettings.reset(new GPUSettingsProcessingNNclusterizer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this here, all these settings should already be made available from mConfig->configProcessing.nn

o2::tpc::NeuralNetworkClusterizer nnClusterizerFetcher;
nnClusterizerFetcher.initCcdbApi(mNNClusterizerSettings->nnCCDBURL);
std::map<std::string, std::string> ccdbSettings = {
{"nnCCDBURL", mNNClusterizerSettings->nnCCDBURL},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, all CCDB-related settings so far are in GPUSettingsO2, not in GPUSettingsProcessing. Perhaps it would be cleaner to also move the NN CCDB settings there, but since they are anyway encapsulated in your own struct, I don't care much.

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for a67b634 at 2025-04-19 11:43:

## sw/BUILD/O2Physics-latest/log
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:
Error in cling::AutoLoadingVisitor::InsertIntoAutoLoadingState:


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
grep: error-log.txt: binary file matches
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/05b665234baab729088cfcd7ed4a6c4f4e367afa/slc9_x86-64/o2checkcode/1.0-local226/etc/modulefiles
++ cat
--

Full log here.

Please consider the following formatting changes to AliceO2Group#14117
@ChSonnabend
Copy link
Collaborator Author

Technically, the Ort Session is now capable of reserving arena memory with a custom allocator using the volatile memory allocation (tested via print-outs). Unfortunately, when using IO binding I haven't figured out a way yet to also use volatile allocation when doing the tensor allocation, which is actually the significant part of the total memory. For now this isn't used anyway, so adjusting for comments and good to go from my side (will open a separate PR for that once figured out).

@alibuild
Copy link
Collaborator

Error while checking build/O2/fullCI_slc9 for 4d3f54d at 2025-04-20 01:22:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(643): error: identifier "CreateCUDAProviderOptions" is undefined
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(650): error: expression must have class type but it has type "OrtCUDAProviderOptionsV2 *"
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(651): error: identifier "UpdateCUDAProviderOptionsWithValue" is undefined
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(652): error: cannot convert to incomplete class "OrtCUDAProviderOptionsV2"
/sw/SOURCES/O2/14117-slc9_x86-64/0/GPU/GPUTracking/Base/cuda/GPUReconstructionCUDA.cu(655): error: identifier "ReleaseCUDAProviderOptions" is undefined
ninja: build stopped: subcommand failed.

Full log here.

Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now, can be squash-merged once CI is green

@davidrohr davidrohr enabled auto-merge (squash) April 20, 2025 08:02
@davidrohr davidrohr disabled auto-merge April 20, 2025 10:20
@davidrohr davidrohr merged commit 497d53f into AliceO2Group:dev Apr 20, 2025
13 checks passed
@ChSonnabend ChSonnabend deleted the onnx_gpu_implementation branch April 25, 2025 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants