Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load all MoE experts during warmup #11571

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fairydreaming
Copy link
Collaborator

This PR is a somewhat crude hack that allows to load all experts in MoE models during warmup.

The hacky part is the warmup detection - I explicitly examine the ubatch tokens to detect the warmup.
I couldn't find a better way to do it, let me know if one exists.

If the model is warming up then n_expert_used is set to n_expert, this will cause all existing experts to be loaded to memory during warmup.

Fixes #11163

@cpumaxx
Copy link
Contributor

cpumaxx commented Feb 3, 2025

A quick test with R1 on llama-server shows all experts loaded into memory during warmup. Inference started immediately once the web interface was available.
I will try a test on a non-MoE large model as well to make sure there are no regressions in that case.
Thanks for this fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: model warmup doesn't work correctly for MoE models
3 participants