Load all MoE experts during warmup #11571

fairydreaming · 2025-02-01T09:42:00Z

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.
I couldn't find a better way to do it, let me know if one exists.

If the model is warming up then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during warmup.

Fixes #11163

…f nodes during warmup

cpumaxx · 2025-02-03T17:05:26Z

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available.
I will try a test on a non-MoE large model as well to make sure there are no regressions in that case.
Thanks for this fix!

llama : use all experts during warmup

83a473a

fairydreaming mentioned this pull request Feb 1, 2025

Misc. bug: model warmup doesn't work correctly for MoE models #11163

Open

llama : increased max_nodes as large MoE models use massive amounts o…

c8bc6e4

…f nodes during warmup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load all MoE experts during warmup #11571

Load all MoE experts during warmup #11571

fairydreaming commented Feb 1, 2025

cpumaxx commented Feb 3, 2025

Load all MoE experts during warmup #11571

Are you sure you want to change the base?

Load all MoE experts during warmup #11571

Conversation

fairydreaming commented Feb 1, 2025

cpumaxx commented Feb 3, 2025