简体中文 | English
TensorRT-YOLO provides example code for multi-threading and multi-processing inference for both Python and C++ developers:
TensorRT-YOLO allows multiple threads to share a single engine for inference. One engine can support multiple contexts simultaneously, meaning multiple threads can share the same model weights and parameters while maintaining only one copy in memory or GPU memory. As a result, even though multiple objects are cloned, the memory footprint of the model does not increase linearly.
TensorRT-YOLO provides the following interfaces for model cloning (using DetectModel as an example):
- Python:
DetectModel.clone()
- C++:
DetectModel::clone()
import cv2
from tensorrt_yolo.infer import InferOption, DetectModel, generate_labels, visualize
# Configure inference options
option = InferOption()
option.enable_swap_rb()
# Initialize the model
model = DetectModel("yolo11n-with-plugin.engine", option)
# Load an image
im = cv2.imread("test_image.jpg")
# Perform model prediction
result = model.predict(im)
print(f"==> Detection result: {result}")
# Visualize the detection results
labels = generate_labels("labels.txt")
vis_im = visualize(im, result, labels)
cv2.imwrite("vis_image.jpg", vis_im)
# Clone the model and perform prediction
clone_model = model.clone()
clone_result = clone_model.predict(im)
print(f"==> Cloned model detection result: {clone_result}")
#include <memory>
#include <opencv2/opencv.hpp>
// For ease of use, the module uses standard libraries except for CUDA and TensorRT
#include "deploy/model.hpp" // Contains class definitions for model inference
#include "deploy/option.hpp" // Contains configuration classes for inference options
#include "deploy/result.hpp" // Contains definitions for inference results
int main() {
// Configure inference options
deploy::InferOption option;
option.enableSwapRB(); // Enable channel swapping (BGR to RGB)
// Initialize the model
auto model = std::make_unique<deploy::DetectModel>("yolo11n-with-plugin.engine", option);
// Load an image
cv::Mat cvim = cv::imread("test_image.jpg");
deploy::Image im(cvim.data, cvim.cols, cvim.rows);
// Perform model prediction
deploy::DetResult result = model->predict(im);
// Visualization (code omitted)
// ... // Visualization code is not provided and can be implemented as needed
// Clone the model and perform prediction
auto clone_model = model->clone();
deploy::DetResult clone_result = clone_model->predict(im);
return 0; // Program ends successfully
}
Due to Python's Global Interpreter Lock (GIL) limitation, multi-threading cannot fully utilize multi-core CPU performance in compute-intensive tasks. Therefore, Python provides both multi-processing and multi-threading approaches for concurrent inference. Below is a comparison:
Resource Usage | Compute-Intensive Tasks | I/O-Intensive Tasks | Inter-process/Thread Communication | |
---|---|---|---|---|
Multi-processing | High | Fast | Fast | Slow |
Multi-threading | Low | Slow | Relatively Fast | Fast |
Note
The above analysis is theoretical. In practice, since result aggregation across processes involves inter-process communication and task types (compute-intensive or I/O-intensive) are often hard to distinguish, specific tasks require testing and optimization.
C++ multi-threading inference has significant advantages in terms of resource usage and performance, making it the best choice for concurrent inference.
Hardware Configuration:
- CPU: AMD Ryzen 7 5700X 8-Core Processor
- GPU: NVIDIA GeForce RTX 2080 Ti 22GB
Model: YOLO11x
Number of Models | model.clone() |
No Cloning |
---|---|---|
1 | 408M | 408M |
2 | 536M | 716M |
3 | 662M | 1092M |
4 | 790M | 1470M |
Important
After using model.clone()
, GPU memory usage still increases slightly due to the following reasons:
-
Pre-allocation of Input/Output GPU Memory:
- Each
context
requires independent GPU memory buffers for model inputs and outputs. Even though multiplecontexts
share the sameengine
, input/output buffers need to be allocated separately, which consumes additional GPU memory. - The GPU memory usage of input/output buffers depends on the shape and data type of the model's inputs and outputs. For larger inputs and outputs, this memory usage can be significant.
- Each
-
Context GPU Memory Overhead:
- Each
context
is an independent execution context used to manage the execution flow of model inference. Even though theengine
is shared, eachcontext
still requires additional GPU memory to store execution states, intermediate results, and temporary buffers. - The GPU memory usage of a
context
is typically much smaller than that of anengine
, but it increases gradually as the number ofcontexts
grows.
- Each
-
CUDA Runtime Overhead:
- The CUDA runtime requires additional GPU memory for each
context
to manage execution flows, kernel launches, and memory operations. - TensorRT may use optimization techniques (e.g., memory reuse, kernel fusion) during inference, which can introduce additional GPU memory overhead.
- The CUDA runtime requires additional GPU memory for each