-
Notifications
You must be signed in to change notification settings - Fork 12.5k
Closed
Labels
generation qualityQuality of model outputQuality of model outputperformanceSpeed related topicsSpeed related topicsresearch 🔬
Description
Speculative sampling is explained here: https://arxiv.org/abs/2302.01318
In more simple terms here:
- Combine large LLM with small LLM for faster inference #630 (comment)
- Combine large LLM with small LLM for faster inference #630 (comment)
For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.
We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggml-org/ggml#293
lin72h, bzz, fritzprix, dengzheng-cloud, Yuchen-Cao and 4 morelin72h, mirek190, the-crypt-keeper, randxie, martindevans and 7 moreTheSeamau5
Metadata
Metadata
Assignees
Labels
generation qualityQuality of model outputQuality of model outputperformanceSpeed related topicsSpeed related topicsresearch 🔬