You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RUN cd /workspace/trtis-kaldi-backend && wget https://github.com/NVIDIA/tensorrt-inference-server/releases/download/v1.9.0/v1.9.0_ubuntu1804.custombackend.tar.gz -O custom-backend-sdk.tar.gz && tar -xzf custom-backend-sdk.tar.gz
48
-
RUN cd /workspace/trtis-kaldi-backend/ && make && cp libkaldi-trtisbackend.so /workspace/model-repo/kaldi_online/1/ && cd - && rm -r /workspace/trtis-kaldi-backend
Copy file name to clipboardExpand all lines: Kaldi/SpeechRecognition/README.md
+26-28
Original file line number
Diff line number
Diff line change
@@ -46,15 +46,18 @@ A reference model is used by all test scripts and benchmarks presented in this r
46
46
Details about parameters can be found in the [Parameters](#parameters) section.
47
47
48
48
*`model path`: Configured to use the pretrained LibriSpeech model.
49
+
*`use_tensor_cores`: 1
50
+
*`main_q_capacity`: 30000
51
+
*`aux_q_capacity`: 400000
49
52
*`beam`: 10
53
+
*`num_channels`: 4000
50
54
*`lattice_beam`: 7
51
55
*`max_active`: 10,000
52
56
*`frame_subsampling_factor`: 3
53
57
*`acoustic_scale`: 1.0
54
-
*`num_worker_threads`: 20
55
-
*`max_execution_batch_size`: 256
56
-
*`max_batch_size`: 4096
57
-
*`instance_group.count`: 2
58
+
*`num_worker_threads`: 40
59
+
*`max_batch_size`: 400
60
+
*`instance_group.count`: 1
58
61
59
62
## Setup
60
63
@@ -134,9 +137,8 @@ The model configuration parameters are passed to the model and have an impact o
134
137
135
138
The inference engine configuration parameters configure the inference engine. They impact performance, but not accuracy.
136
139
137
-
*`max_batch_size`: The maximum number of inference channels opened at a given time. If set to `4096`, then one instance will handle at most 4096 concurrent users.
140
+
*`max_batch_size`: The size of one execution batch on the GPU. This parameter should be set as large as necessary to saturate the GPU, but not bigger. Larger batches will lead to a higher throughput, smaller batches to lower latency.
138
141
*`num_worker_threads`: The number of CPU threads for the postprocessing CPU tasks, such as lattice determinization and text generation from the lattice.
139
-
*`max_execution_batch_size`: The size of one execution batch on the GPU. This parameter should be set as large as necessary to saturate the GPU, but not bigger. Larger batches will lead to a higher throughput, smaller batches to lower latency.
140
142
*`input.WAV_DATA.dims`: The maximum number of samples per chunk. The value must be a multiple of `frame_subsampling_factor * chunks_per_frame`.
141
143
142
144
### Inference process
@@ -156,7 +158,7 @@ The client can be configured through a set of parameters that define its behavio
156
158
-u <URL for inference service and its gRPC port>
157
159
-o : Only feed each channel at realtime speed. Simulates online clients.
158
160
-p : Print text outputs
159
-
161
+
-b : Print partial (best path) text outputs
160
162
```
161
163
162
164
### Input/Output
@@ -187,13 +189,8 @@ Even if only the best path is used, we are still generating a full lattice for b
187
189
188
190
Support for Kaldi ASR models that are different from the provided LibriSpeech model is experimental. However, it is possible to modify the [Model Path](#model-path) section of the config file `model-repo/kaldi_online/config.pbtxt` to set up your own model.
189
191
190
-
The models and Kaldi allocators are currently not shared between instances. This means that if your model is large, you may end up with not enough memory on the GPU to store two different instances. If that's the case,
191
-
you can set `count` to `1` in the [`instance_group` section](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#instance-groups) of the config file.
192
-
193
192
## Performance
194
193
195
-
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
196
-
197
194
198
195
### Metrics
199
196
@@ -207,8 +204,7 @@ Latency is defined as the delay between the availability of the last chunk of au
207
204
4.*Server:* Compute inference of last chunk
208
205
5.*Server:* Generate the raw lattice for the full utterance
209
206
6.*Server:* Determinize the raw lattice
210
-
7.*Server:* Generate the text output associated with the best path in the determinized lattice
211
-
8.*Client:* Receive text output
207
+
8.*Client:* Receive lattice output
212
208
9.*Client:* Call callback with output
213
209
10.***t1** <- Current time*
214
210
@@ -219,20 +215,18 @@ The latency is defined such as `latency = t1 - t0`.
219
215
Our results were obtained by:
220
216
221
217
1. Building and starting the server as described in [Quick Start Guide](#quick-start-guide).
222
-
2. Running `scripts/run_inference_all_v100.sh` and `scripts/run_inference_all_t4.sh`
223
-
224
-
225
-
| GPU | Realtime I/O | Number of parallel audio channels | Throughput (RTFX) | Latency ||||
0 commit comments