简体中文 | English

TensorRT-YOLO Multi-threading and Multi-processing Inference Guide

TensorRT-YOLO provides example code for multi-threading and multi-processing inference for both Python and C++ developers:

Python Multi-threading/Multi-processing Inference Example
C++ Multi-threading Inference Example

Model Cloning in Multi-threading Inference

TensorRT-YOLO allows multiple threads to share a single engine for inference. One engine can support multiple contexts simultaneously, meaning multiple threads can share the same model weights and parameters while maintaining only one copy in memory or GPU memory. As a result, even though multiple objects are cloned, the memory footprint of the model does not increase linearly.

TensorRT-YOLO provides the following interfaces for model cloning (using DetectModel as an example):

Python: DetectModel.clone()
C++: DetectModel::clone()

Python Example

import cv2
from tensorrt_yolo.infer import InferOption, DetectModel, generate_labels, visualize

# Configure inference options
option = InferOption()
option.enable_swap_rb()

# Initialize the model
model = DetectModel("yolo11n-with-plugin.engine", option)

# Load an image
im = cv2.imread("test_image.jpg")

# Perform model prediction
result = model.predict(im)
print(f"==> Detection result: {result}")

# Visualize the detection results
labels = generate_labels("labels.txt")
vis_im = visualize(im, result, labels)
cv2.imwrite("vis_image.jpg", vis_im)

# Clone the model and perform prediction
clone_model = model.clone()
clone_result = clone_model.predict(im)
print(f"==> Cloned model detection result: {clone_result}")

C++ Example

#include <memory>
#include <opencv2/opencv.hpp>

// For ease of use, the module uses standard libraries except for CUDA and TensorRT
#include "deploy/model.hpp"  // Contains class definitions for model inference
#include "deploy/option.hpp"  // Contains configuration classes for inference options
#include "deploy/result.hpp"  // Contains definitions for inference results

int main() {
    // Configure inference options
    deploy::InferOption option;
    option.enableSwapRB();  // Enable channel swapping (BGR to RGB)

    // Initialize the model
    auto model = std::make_unique<deploy::DetectModel>("yolo11n-with-plugin.engine", option);

    // Load an image
    cv::Mat cvim = cv::imread("test_image.jpg");
    deploy::Image im(cvim.data, cvim.cols, cvim.rows);

    // Perform model prediction
    deploy::DetResult result = model->predict(im);

    // Visualization (code omitted)
    // ...  // Visualization code is not provided and can be implemented as needed

    // Clone the model and perform prediction
    auto clone_model = model->clone();
    deploy::DetResult clone_result = clone_model->predict(im);

    return 0;  // Program ends successfully
}

Python Multi-threading vs. Multi-processing Inference

Due to Python's Global Interpreter Lock (GIL) limitation, multi-threading cannot fully utilize multi-core CPU performance in compute-intensive tasks. Therefore, Python provides both multi-processing and multi-threading approaches for concurrent inference. Below is a comparison:

Comparison of Multi-processing and Multi-threading Inference

	Resource Usage	Compute-Intensive Tasks	I/O-Intensive Tasks	Inter-process/Thread Communication
Multi-processing	High	Fast	Fast	Slow
Multi-threading	Low	Slow	Relatively Fast	Fast

Note

The above analysis is theoretical. In practice, since result aggregation across processes involves inter-process communication and task types (compute-intensive or I/O-intensive) are often hard to distinguish, specific tasks require testing and optimization.

C++ Multi-threading Inference

C++ multi-threading inference has significant advantages in terms of resource usage and performance, making it the best choice for concurrent inference.

GPU Memory Usage Comparison: Model Cloning vs. No Cloning in C++

Hardware Configuration:

CPU: AMD Ryzen 7 5700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2080 Ti 22GB
Model: YOLO11x

Number of Models	`model.clone()`	No Cloning
1	408M	408M
2	536M	716M
3	662M	1092M
4	790M	1470M

Important

After using model.clone(), GPU memory usage still increases slightly due to the following reasons:

Pre-allocation of Input/Output GPU Memory:
- Each context requires independent GPU memory buffers for model inputs and outputs. Even though multiple contexts share the same engine, input/output buffers need to be allocated separately, which consumes additional GPU memory.
- The GPU memory usage of input/output buffers depends on the shape and data type of the model's inputs and outputs. For larger inputs and outputs, this memory usage can be significant.
Context GPU Memory Overhead:
- Each context is an independent execution context used to manage the execution flow of model inference. Even though the engine is shared, each context still requires additional GPU memory to store execution states, intermediate results, and temporary buffers.
- The GPU memory usage of a context is typically much smaller than that of an engine, but it increases gradually as the number of contexts grows.
CUDA Runtime Overhead:
- The CUDA runtime requires additional GPU memory for each context to manage execution flows, kernel launches, and memory operations.
- TensorRT may use optimization techniques (e.g., memory reuse, kernel fusion) during inference, which can introduce additional GPU memory overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.en.md

README.en.md

TensorRT-YOLO Multi-threading and Multi-processing Inference Guide

Model Cloning in Multi-threading Inference

Python Example

C++ Example

Python Multi-threading vs. Multi-processing Inference

Comparison of Multi-processing and Multi-threading Inference

C++ Multi-threading Inference

GPU Memory Usage Comparison: Model Cloning vs. No Cloning in C++

Files

README.en.md

Latest commit

History

README.en.md

File metadata and controls

TensorRT-YOLO Multi-threading and Multi-processing Inference Guide

Model Cloning in Multi-threading Inference

Python Example

C++ Example

Python Multi-threading vs. Multi-processing Inference

Comparison of Multi-processing and Multi-threading Inference

C++ Multi-threading Inference

GPU Memory Usage Comparison: Model Cloning vs. No Cloning in C++