Description
🚀 The feature, motivation and pitch
I've been dabbling around with integrating local models into desktop apps and one thing that I found quite handy for development was being able to fetch prebuilt binaries for llama.cpp from their releases page. These could then easily be dropped into developing a desktop application and spawned from the main process of whatever framework is in use (e.g. Electron, Wails, Tauri). Running the llama-server
binary then allows application developers to utilise local LMs with simple API calls.
In a similar vein, I was wondering whether it would be possible to do two things:
-
Design an "ET server" binary which when run with any
.pte
file, will essentially serve the model for inference using the ET runtime (perhaps it also already exists and I may have not found it). The server could be spawned using something likeet-server --model my_model.pte --port 8080
. -
Distribute
et-server
as a precompiled binary for different targets (windows, mac, linux, etc) via GH Releases or another channel. I think it could make it easier for developers to bundle and distribute et-server in their desktop applications without needing to build from source.
I can see how the first point might be particularly challenging, especially since nature of inputs can vary depending on the model + its task. Perhaps it could be better suited for torchchat.
Either way, just some thoughts and would love to know if there are better ways to handle this pain point of bringing the ET runtime closer to app dev! Thanks for reading :)
cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng
Metadata
Metadata
Assignees
Type
Projects
Status