You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 24, 2024. It is now read-only.
I have an issue for serving Llama2-70B-GGML model. The 65B llama and 70B Llama-2 models use grouped query attention. This is done in llama.cpp by specifying the n_gqa params in model hyperparameters which feels a little bit hacky 🤔
I would love to work on adding support for the n_gqa on this crate, I think that it can be added to the Llama model: