Skip to content

Commit

Permalink
Add number for latency comparison (#4612)
Browse files Browse the repository at this point in the history
This PR adds latency comparison
  • Loading branch information
tohtana authored Nov 3, 2023
1 parent 58d3b65 commit ff53c22
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions blogs/deepspeed-fastgen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,10 +165,12 @@ When vLLM preempts the ongoing generation of previous requests, the generation l
### D. Token Level Timing Analysis

Figure 5 displays the P50, P90, and P95 latencies of the generation processes. Both vLLM and DeepSpeed-FlexGen exhibit similar P50 latencies, but vLLM demonstrates significantly higher latencies for P90 and P95.
Regarding the P95 latencies, DeepSpeed-FlexGen achieved a reduction of 3.7 times.

This discrepancy is due to a noticeable spike in vLLM's generation latency when it preempts the ongoing generation to process new prompts.
In contrast, DeepSpeed-FastGen typically processes the prompt and generation for previous requests concurrently, leading to much more consistent generation latency.


<div align="center">
<img src="assets/images/token_latency.png" alt="" width="400"/><br>

Expand Down

0 comments on commit ff53c22

Please sign in to comment.