@@ -89,8 +89,8 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
89
89
| Llama-2-70B | Base | OOM ||
90
90
| | 8-bit | 19.13 | 1322.58 |
91
91
| | 4-bit (G=32) | 25.25 | 1097.66 |
92
- | Llama-3-8B | Base | 93.95 | 1508.18 |
93
- | | 8-bit | 114.35 | 978.02 |
92
+ | Llama-3-8B | Base | 94.25 | 1411.95 |
93
+ | | 8-bit | 139.55 | 1047.23 |
94
94
95
95
### Speculative Sampling
96
96
[ Verifier: Llama-70B (int4), Draft: Llama-7B (int4)] ( ./scripts/speculate_70B_int4.sh ) : 48.4 tok/s
@@ -106,10 +106,10 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh
106
106
| | 2 | 21.32 | 1481.87 |
107
107
| | 4 | 38.01 | 1340.76 |
108
108
| | 8 | 62.50 | 1135.29 |
109
- | Llama-3-8B | 1 | 93.97 | 1508.46 |
110
- | | 2 | 149.44 | 1358.63 |
111
- | | 4 | 217.80 | 1218.76 |
112
- | | 8 | 271.03 | 1041.99 |
109
+ | Llama-3-8B | 1 | 94.19 | 1411.76 |
110
+ | | 2 | 150.48 | 1208.80 |
111
+ | | 4 | 219.77 | 991.63 |
112
+ | | 8 | 274.65 | 768.55 |
113
113
114
114
### Tensor Parallelism + Quantization
115
115
| Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) |
0 commit comments