Running multiple tiny models in parallel on a single GPU #2017
Replies: 2 comments 1 reply
-
Hey — technically yes, but it gets weird real fast. You can run multiple
That said… if you’re trying to simulate multi-agent reasoning across terminals (or multi-model chatrooms), I’ve been testing a setup using a semantic reasoning layer on top — where each agent holds stable logic even across split terminals. Might be overkill for now, but if hallucination or memory drift shows up later, that layer can save your life. |
Beta Was this translation helpful? Give feedback.
-
Hey! Yep — that link is correct: https://github.com/onestardao/WFGY Since you're exploring multi-agent setups, here are a few things WFGY helps with — you might want to double check depending on your use case: Potential questions to ask yourself: Do you want each agent to retain separate memory traces or converge toward a shared mental state? Have you seen hallucinations or memory drift across agents (especially in tool-calling or self-reflection loops)? Are agents passing tasks to one another asynchronously? WFGY’s semantic layer was built exactly for this kind of logic anchoring — we run it as a reasoning validator between agents, memory handlers, and output injectors. If you're curious, here's our current Problem Map — it lists common bugs we’ve solved (hallucination loops, memory recursion, broken tool chains, etc.): Let me know what you're building — I’d be happy to help match the module to your setup! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a Nvidia Tesla GPU with 32GB VRAM. I can instantiate 10 instances of llama-cpp-python on that single GPU with Qwen3-0.6B on different terminal sessions. My question is, do these models run in parallel if the
create_chat_completion
is invoked in 10 concurrently in separate terminal instances?Beta Was this translation helpful? Give feedback.
All reactions