RPC with tensor override #17075

yggdrasil75 · 2025-11-07T12:43:17Z

yggdrasil75
Nov 7, 2025

I have a system with 2 3090s and a system with a 395+. the 395+ doesnt have pcie slots but runs decent vulkan speeds.
I want to run a larger moe split across the 2 systems. glm air I am getting 191 t/s pp and 17 t/s tg. these numbers are quite good for a 106b at q6k.
but I figured I can do better by using the 3090s as cuda devices and ffn_up on the 395+. unfortunately that fails and I get 3 t/s pp and 4 t/s tg.

is it possible to offload the tensors by layer per device? ie: layers 1-14 on rpc0 with ffn_up from those layers on rpc0 system ram, layers 15-28 on rpc1 with those layers on rpc1 system ram, and the remaining layers on vulkan0 completely?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RPC with tensor override #17075

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RPC with tensor override #17075

Uh oh!

yggdrasil75 Nov 7, 2025

Replies: 0 comments

yggdrasil75
Nov 7, 2025