|
| 1 | +--- |
| 2 | +title: Benchmarking via onnxruntime_perf_test |
| 3 | +weight: 6 |
| 4 | + |
| 5 | +### FIXED, DO NOT MODIFY |
| 6 | +layout: learningpathall |
| 7 | +--- |
| 8 | + |
| 9 | +Now that you’ve set up and run the ONNX model (e.g., SqueezeNet), you can use it to benchmark inference performance using Python-based timing or tools like **onnxruntime_perf_test**. This helps evaluate the ONNX Runtime efficiency on Azure Arm64-based Cobalt 100 instances. |
| 10 | + |
| 11 | +You can also compare the inference time between Cobalt 100 (Arm64) and similar D-series x86_64-based virtual machine on Azure. |
| 12 | + |
| 13 | +## Run the performance tests using onnxruntime_perf_test |
| 14 | +The **onnxruntime_perf_test** is a performance benchmarking tool included in the ONNX Runtime source code. It is used to measure the inference performance of ONNX models under various runtime conditions (like CPU, GPU, or other execution providers). |
| 15 | + |
| 16 | +### Install Required Build Tools |
| 17 | + |
| 18 | +```console |
| 19 | +sudo apt update |
| 20 | +sudo apt install -y build-essential cmake git unzip pkg-config |
| 21 | +sudo apt install -y protobuf-compiler libprotobuf-dev libprotoc-dev git |
| 22 | +``` |
| 23 | +Then verify: |
| 24 | +```console |
| 25 | +protoc --version |
| 26 | +``` |
| 27 | +You should see an output similar to: |
| 28 | + |
| 29 | +```output |
| 30 | +libprotoc 3.21.12 |
| 31 | +``` |
| 32 | +### Build ONNX Runtime from Source: |
| 33 | + |
| 34 | +The benchmarking tool, **onnxruntime_perf_test**, isn’t available as a pre-built binary artifact for any platform. So, you have to build it from the source, which is expected to take around 40-50 minutes. |
| 35 | + |
| 36 | +Clone onnxruntime: |
| 37 | +```console |
| 38 | +git clone --recursive https://github.com/microsoft/onnxruntime |
| 39 | +cd onnxruntime |
| 40 | +``` |
| 41 | +Now, build the benchmark as below: |
| 42 | + |
| 43 | +```console |
| 44 | +./build.sh --config Release --build_dir build/Linux --build_shared_lib --parallel --build --update --skip_tests |
| 45 | +``` |
| 46 | +This will build the benchmark tool inside ./build/Linux/Release/onnxruntime_perf_test. |
| 47 | + |
| 48 | +### Run the benchmark |
| 49 | +Now that the benchmarking tool has been built, you can benchmark the **squeezenet-int8.onnx** model, as below: |
| 50 | + |
| 51 | +```console |
| 52 | +./build/Linux/Release/onnxruntime_perf_test -e cpu -r 100 -m times -s -Z -I <path-to-squeezenet-int8.onnx> |
| 53 | +``` |
| 54 | +- **e cpu**: Use the CPU execution provider (not GPU or any other backend). |
| 55 | +- **r 100**: Run 100 inferences. |
| 56 | +- **m times**: Use "repeat N times" mode. |
| 57 | +- **s**: Show detailed statistics. |
| 58 | +- **Z**: Disable intra-op thread spinning (reduces CPU usage when idle between runs). |
| 59 | +- **I**: Input the ONNX model path without using input/output test data. |
| 60 | + |
| 61 | +You should see an output similar to: |
| 62 | + |
| 63 | +```output |
| 64 | +Disabling intra-op thread spinning between runs |
| 65 | +Session creation time cost: 0.0102016 s |
| 66 | +First inference time cost: 2 ms |
| 67 | +Total inference time cost: 0.185739 s |
| 68 | +Total inference requests: 100 |
| 69 | +Average inference time cost: 1.85739 ms |
| 70 | +Total inference run time: 0.18581 s |
| 71 | +Number of inferences per second: 538.184 |
| 72 | +Avg CPU usage: 96 % |
| 73 | +Peak working set size: 36696064 bytes |
| 74 | +Avg CPU usage:96 |
| 75 | +Peak working set size:36696064 |
| 76 | +Runs:100 |
| 77 | +Min Latency: 0.00183404 s |
| 78 | +Max Latency: 0.00190312 s |
| 79 | +P50 Latency: 0.00185674 s |
| 80 | +P90 Latency: 0.00187215 s |
| 81 | +P95 Latency: 0.00187393 s |
| 82 | +P99 Latency: 0.00190312 s |
| 83 | +P999 Latency: 0.00190312 s |
| 84 | +``` |
| 85 | +### Benchmark Metrics Explained |
| 86 | + |
| 87 | +- **Average Inference Time**: The mean time taken to process a single inference request across all runs. Lower values indicate faster model execution. |
| 88 | +- **Throughput**: The number of inference requests processed per second. Higher throughput reflects the model’s ability to handle larger workloads efficiently. |
| 89 | +- **CPU Utilization**: The percentage of CPU resources used during inference. A value close to 100% indicates full CPU usage, which is expected during performance benchmarking. |
| 90 | +- **Peak Memory Usage**: The maximum amount of system memory (RAM) consumed during inference. Lower memory usage is beneficial for resource-constrained environments. |
| 91 | +- **P50 Latency (Median Latency)**: The time below which 50% of inference requests complete. Represents typical latency under normal load. |
| 92 | +- **Latency Consistency**: Describes the stability of latency values across all runs. "Consistent" indicates predictable inference performance with minimal jitter. |
| 93 | + |
| 94 | +### Benchmark summary on Arm64: |
| 95 | +Here is a summary of benchmark results collected on an Arm64 **D4ps_v6 Ubuntu Pro 24.04 LTS virtual machine**. |
| 96 | + |
| 97 | +| **Metric** | **Value** | |
| 98 | +|----------------------------|-------------------------------| |
| 99 | +| **Average Inference Time** | 1.857 ms | |
| 100 | +| **Throughput** | 538.18 inferences/sec | |
| 101 | +| **CPU Utilization** | 96% | |
| 102 | +| **Peak Memory Usage** | 36.70 MB | |
| 103 | +| **P50 Latency** | 1.857 ms | |
| 104 | +| **P90 Latency** | 1.872 ms | |
| 105 | +| **P95 Latency** | 1.874 ms | |
| 106 | +| **P99 Latency** | 1.903 ms | |
| 107 | +| **P999 Latency** | 1.903 ms | |
| 108 | +| **Max Latency** | 1.903 ms | |
| 109 | +| **Latency Consistency** | Consistent | |
| 110 | + |
| 111 | + |
| 112 | +### Benchmark summary on x86 |
| 113 | +Here is a summary of benchmark results collected on x86 **D4s_v6 Ubuntu Pro 24.04 LTS virtual machine**. |
| 114 | + |
| 115 | +| **Metric** | **Value on Virtual Machine** | |
| 116 | +|----------------------------|-------------------------------| |
| 117 | +| **Average Inference Time** | 1.413 ms | |
| 118 | +| **Throughput** | 707.48 inferences/sec | |
| 119 | +| **CPU Utilization** | 100% | |
| 120 | +| **Peak Memory Usage** | 38.80 MB | |
| 121 | +| **P50 Latency** | 1.396 ms | |
| 122 | +| **P90 Latency** | 1.501 ms | |
| 123 | +| **P95 Latency** | 1.520 ms | |
| 124 | +| **P99 Latency** | 1.794 ms | |
| 125 | +| **P999 Latency** | 1.794 ms | |
| 126 | +| **Max Latency** | 1.794 ms | |
| 127 | +| **Latency Consistency** | Consistent | |
| 128 | + |
| 129 | + |
| 130 | +### Highlights from Ubuntu Pro 24.04 Arm64 Benchmarking |
| 131 | + |
| 132 | +When comparing the results on Arm64 vs x86_64 virtual machines: |
| 133 | +- **Low-Latency Inference:** Achieved consistent average inference times of ~1.86 ms on Arm64. |
| 134 | +- **Strong and Stable Throughput:** Sustained throughput of over 538 inferences/sec using the `squeezenet-int8.onnx` model on D4ps_v6 instances. |
| 135 | +- **Lightweight Resource Footprint:** Peak memory usage stayed below 37 MB, with CPU utilization around 96%, ideal for efficient edge or cloud inference. |
| 136 | +- **Consistent Performance:** P50, P95, and Max latency remained tightly bound, showcasing reliable performance on Azure Cobalt 100 Arm-based infrastructure. |
| 137 | + |
| 138 | +You have now benchmarked ONNX on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. |
0 commit comments