update readme for GPU

knights-analytics · Apr 25, 2024 · 8afe37f · 8afe37f
1 parent 06e8d2c
commit 8afe37f
Showing 1 changed file with 13 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -12,6 +12,8 @@ The goal of this library is to provide an easy, scalable, and hassle-free way to
 2. Hassle-free and performant production use: we exclusively support onnx exports of huggingface models. Pytorch transformer models that don't have an onnx version can be easily exported to onnx via [huggingface optimum](https://huggingface.co/docs/optimum/index), and used with the library
 3. Run on your hardware: this library is for those who want to run transformer models tightly coupled with their go applications, without the performance drawbacks of having to hit a rest API, or the hassle of setting up and maintaining e.g. a python RPC service that talks to go.
 
+We support all GPU/accelerator backends supported by ONNXRuntime.
+
 ## Why
 
 Developing and fine-tuning transformer models with the huggingface python library is a great experience, but if your production stack is golang-based being able to reliably deploy and scale the resulting pytorch models can be challenging and require quite some setup. This library aims to allow you to just lift-and-shift your python model and use the same huggingface pipelines you use for development for inference in a go application.
@@ -32,6 +34,15 @@ Implementations for additional pipelines will follow. We also very gladly accept
 
 Hugot can be used both as a library and as a command-line application. See below for usage instructions.
 
+Hugot now also supports the following accelerator backends:
+ - CUDA (tested)
+ - TensorRT (untested)
+ - DirectML (untested)
+ - CoreML (untested)
+ - OpenVINO (untested)
+
+Please help us out by testing the untested options above and providing feedback, good or bad!
+
 ## Limitations
 
 Apart from the fact that only the aforementioned pipelines are currently implemented, the current limitations are:
@@ -196,6 +207,8 @@ session, err := hugot.NewSession(
 
 InterOpNumThreads and IntraOpNumThreads constricts each goroutine's call to a single core, greatly reducing locking and cache penalties. Disabling CpuMemArena and MemPattern skips pre-allocation of some memory structures, increasing latency, but also throughput efficiency.
 
+For GPU the config above also applies. We are still testing the optimum GPU configuration, whether it is better to run in parallel or with a single thread, and what size of input batch is fastest.
+
 ## Contributing
 
 ### Development environment