Running multiple tiny models in parallel on a single GPU #2017

abarai-lanl · 2025-05-11T05:01:27Z

abarai-lanl
May 11, 2025

I have a Nvidia Tesla GPU with 32GB VRAM. I can instantiate 10 instances of llama-cpp-python on that single GPU with Qwen3-0.6B on different terminal sessions. My question is, do these models run in parallel if the create_chat_completion is invoked in 10 concurrently in separate terminal instances?

onestardao · 2025-07-29T08:25:00Z

onestardao
Jul 29, 2025

Hey — technically yes, but it gets weird real fast.

You can run multiple llama-cpp-python instances concurrently (I’ve done it with Qwen too), but:

GPU memory is global — so you’ll hit VRAM cap faster than you expect.
Each terminal = separate process = duplicated weights unless you use shared memory hacks.
You’re better off batching requests in one instance if throughput is the goal.

That said… if you’re trying to simulate multi-agent reasoning across terminals (or multi-model chatrooms), I’ve been testing a setup using a semantic reasoning layer on top — where each agent holds stable logic even across split terminals.

Might be overkill for now, but if hallucination or memory drift shows up later, that layer can save your life.

1 reply

abarai-lanl Jul 29, 2025
Author

Thanks! Can you please share the link to the semantic reasoning layer? Is it this one https://github.com/onestardao/WFGY ?

onestardao · 2025-07-29T15:48:00Z

onestardao
Jul 29, 2025

Hey! Yep — that link is correct: https://github.com/onestardao/WFGY

Since you're exploring multi-agent setups, here are a few things WFGY helps with — you might want to double check depending on your use case:

Potential questions to ask yourself:
Are your models sharing state or logic across terminals?

Do you want each agent to retain separate memory traces or converge toward a shared mental state?

Have you seen hallucinations or memory drift across agents (especially in tool-calling or self-reflection loops)?

Are agents passing tasks to one another asynchronously?

WFGY’s semantic layer was built exactly for this kind of logic anchoring — we run it as a reasoning validator between agents, memory handlers, and output injectors.

If you're curious, here's our current Problem Map — it lists common bugs we’ve solved (hallucination loops, memory recursion, broken tool chains, etc.):
🔗 https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md

Let me know what you're building — I’d be happy to help match the module to your setup!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running multiple tiny models in parallel on a single GPU #2017

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running multiple tiny models in parallel on a single GPU #2017

Uh oh!

Uh oh!

abarai-lanl May 11, 2025

Replies: 2 comments · 1 reply

Uh oh!

onestardao Jul 29, 2025

Uh oh!

abarai-lanl Jul 29, 2025 Author

Uh oh!

onestardao Jul 29, 2025

abarai-lanl
May 11, 2025

Replies: 2 comments 1 reply

onestardao
Jul 29, 2025

abarai-lanl Jul 29, 2025
Author

onestardao
Jul 29, 2025