Support for GQA and Llama2-70b

turboderp · turboderp · commit b3aea521859b · 2023-07-19T16:43:36.000+02:00
diff --git a/README.md b/README.md
@@ -161,10 +161,12 @@ WikiText, so scores are not necessarily comparable to other Llama benchmarks.
 Since many seem to be interested in running 65B models, I can confirm that this works with two 24 GB GPUs. The
 following benchmarks are from a 4090 + 3090-Ti with `-gs 17.2,24`:
 
-| Model    | Size | groupsize | act | Seq. len.            | VRAM      | Prompt    | Best   | Worst  | Ppl  |
-|----------|------|-----------|-----|----------------------|-----------|-----------|--------|--------|------|
-| Llama    | 65B  | 128       | yes | 2,048 t              | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s | 4.20 |
-| Llama    | 65B  | 32        | yes | 2,048 t              | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s | 4.11 |
+| Model   | Size | groupsize | act | Seq. len.      | VRAM      | Prompt    | Best   | Worst   | Ppl   |
+|---------|------|-----------|-----|----------------|-----------|-----------|--------|---------|-------|
+| Llama   | 65B  | 128       | yes | 2,048 t        | 39,804 MB | 1,109 t/s | 20 t/s | 18 t/s  | 4.20  |
+| Llama   | 65B  | 32        | yes | 2,048 t        | 43,424 MB | 1,037 t/s | 17 t/s | 16 t/s  | 4.11  |
+| Llama-2 | 70B  | 128       | yes | 2,048 t        | 40,680 MB | 1,037 t/s | 17 t/s | 14 t/s  | 4.15  |
+| Llama-2 | 70B  | 32        | yes | 2,048 t        | 36,815 MB | 1,037 t/s | 15 t/s | 12 t/s  | 4.10  |
 
 
 ### Testing long sequences
@@ -192,28 +194,6 @@ confirmed to be working right now.
 
 ## Recent updates
 
-**2023-06-02**: Web UI is now in a fairly working state. Expect it to be a little scuffed in places. There will be a
-rewrite at some point to make the client-side code less seizure-inducing. It has multibot mode, chat rewind and editing
-features, sessions, and more. I'm going to build it out with support for instruct prompting and such, in time.
-
-**2023-06-04**: Refactored a whole bunch to move more of the work into the extension, setting up for more tuning
-options to come soon and eventually auto tuning. Also optimized a little, for about a 5% speedup.
-
-**2023-06-06**: Some minor optimizations. Also it should now compile the extension more easily and run more seamlessly
-on Windows.
-
-**2023-06-09**: Fused most of the self-attention step. More to come. Slight speedup already, but more importantly went
-from 69% actual CPU utilization to 37%. This should do a lot to address the bottleneck on CPUs with lower 
-single-threaded performance.
-
-**2023-06-10**: Docker support now! And some minor optimizations. Cleaned up the project a bit.
-
-**2023-06-11**: Added some concurrency a couple of places. It's only beneficial on the 4090, on small models where the
-cores are somewhat underutilized and the L2 cache can keep up. For the 3090 it's detrimental to performance, so it's
-disabled by default. YMMV. Use `-cs` to try it out.
-
-**2023-06-17**: Fixed a nasty bug in the fused attention that was causing slightly incorrect cache states on 13B and
-33B models. You definitely want to update.
-
-**2023-06-18**: LoRA support now. Still needs a lot of testing and some optimization, and currently you can't stack
-multiple LoRAs during the same inference. There's also no support in the web UI yet.
+**2023-07-19**: Added support for grouped-query attention and Llama-2 70b. There's still a bit of optimization to do,
+since it slows down considerably on very long sequences despite GQA having the potential to be faster. Also could use
+some more thorough testing.
diff --git a/exllama_ext/cuda_func/q4_attn.cu b/exllama_ext/cuda_func/q4_attn.cu
@@ -14,7 +14,7 @@ const int THREADS_X = 32;
 const int THREADS_Y = 1;
 const int THREADS_Z = 4;
 const int BLOCKSIZE_X = 2; // 2*half == 1*uint32_t
-const int BLOCKSIZE_Z = 4; // num_heads must be divisible by BLOCKSIZE_Z
+const int BLOCKSIZE_Z = 4; // num_heads must be divisible by BLOCKSIZE_Z  TODO: Check that this is the case when Llama2-34b releases
 
 __global__ void update_cache_kernel
 (
@@ -23,21 +23,21 @@ __global__ void update_cache_kernel
     half* __restrict__ key_cache,
     half* __restrict__ value_cache,
     const int head_dim,
-    const int num_heads,
+    const int num_kv_heads,
     const int q_len,
     const int max_seq_len,
     const int past_len
 )
 {
-    //int state_shape[]  = {              num_heads,                  q_len, head_dim };
-    int state_stride[] = {               head_dim,   head_dim * num_heads,        1 };
-    int state_pos[]    = {                      0,                      0,        0 };
+    //int state_shape[]  = {              num_kv_heads,                     q_len, head_dim };
+    int state_stride[] = {                  head_dim,   head_dim * num_kv_heads,        1 };
+    int state_pos[]    = {                         0,                         0,        0 };
 
-    //int cache_shape[]  = {              num_heads,            max_seq_len, head_dim };
-    int cache_stride[] = { max_seq_len * head_dim,               head_dim,        1 };
-    int cache_pos[]    = {                      0,               past_len,        0 };
+    //int cache_shape[]  = {              num_kv_heads,               max_seq_len, head_dim };
+    int cache_stride[] = {    max_seq_len * head_dim,                  head_dim,        1 };
+    int cache_pos[]    = {                         0,                  past_len,        0 };
 
-    int size[]         = {              num_heads,                  q_len, head_dim };
+    int size[]         = {              num_kv_heads,                  q_len, head_dim };
 
     int x = (blockIdx.x * THREADS_X + threadIdx.x) * BLOCKSIZE_X; 
     int y = blockIdx.y * THREADS_Y + threadIdx.y;
@@ -92,6 +92,7 @@ void q4_attn_cuda
     const int dim,
     const int head_dim,
     const int num_heads,
+    const int num_kv_heads,
     const int past_len,
     half* key_cache,
     half* value_cache,
@@ -117,10 +118,11 @@ void q4_attn_cuda
     (
         ((head_dim + THREADS_X - 1) / THREADS_X + BLOCKSIZE_X - 1) / BLOCKSIZE_X,
         q_len,
-        ((num_heads + THREADS_Z - 1) / THREADS_Z + BLOCKSIZE_Z - 1) / BLOCKSIZE_Z
+        ((num_kv_heads + THREADS_Z - 1) / THREADS_Z + BLOCKSIZE_Z - 1) / BLOCKSIZE_Z
     );
 
     int _rows_per_batch = q_len * num_heads;
+    int _rows_per_batch_kv = q_len * num_kv_heads;
 
     CudaBuffers* buffers = get_buffers(device_index);
 
@@ -158,11 +160,11 @@ void q4_attn_cuda
         // Positional embeddings q, k
 
         rope_cuda(tuningParams, query_states, sin, cos, bsz, _rows_per_batch, head_dim, num_heads, past_len);
-        rope_cuda(tuningParams, key_states, sin, cos, bsz, _rows_per_batch, head_dim, num_heads, past_len);
+        rope_cuda(tuningParams, key_states, sin, cos, bsz, _rows_per_batch_kv, head_dim, num_kv_heads, past_len);
 
         // Update cache tensors with projected k, v
 
-        update_cache_kernel<<<blocks, threads>>>(key_states, value_states, key_cache, value_cache, head_dim, num_heads, q_len, max_seq_len, past_len);
+        update_cache_kernel<<<blocks, threads>>>(key_states, value_states, key_cache, value_cache, head_dim, num_kv_heads, q_len, max_seq_len, past_len);
     }
     else
     {
@@ -178,20 +180,20 @@ void q4_attn_cuda
         // str_1: project q, positions q, sync
 
         q4_matmul_cuda(tuningParams, temp_x, q_len, q_proj, query_states, q_a ? true : false, str_1);
-        rope_cuda(tuningParams, query_states, sin, cos,  bsz, _rows_per_batch, head_dim, num_heads, past_len, str_1);
+        rope_cuda(tuningParams, query_states, sin, cos,  bsz, _rows_per_batch, head_dim, num_kv_heads, past_len, str_1);
         cudaEventRecord(sync_1, str_1);
 
         // str_2: project k, positions k, sync
 
         q4_matmul_cuda(tuningParams, temp_x, q_len, k_proj, key_states, k_a ? true : false, str_2);
-        rope_cuda(tuningParams, key_states, sin, cos,  bsz, _rows_per_batch, head_dim, num_heads, past_len, str_2);
+        rope_cuda(tuningParams, key_states, sin, cos,  bsz, _rows_per_batch_kv, head_dim, num_kv_heads, past_len, str_2);
         cudaEventRecord(sync_2, str_2);
 
         // str_3: project v, wait for str_2, copy (k,v) to cache, sync
 
         q4_matmul_cuda(tuningParams, temp_x, q_len, v_proj, value_states, v_a ? true : false, buffers->alt_stream_3);
         cudaStreamWaitEvent(str_3, sync_2, 0);
-        update_cache_kernel<<<blocks, threads, 0, str_3>>>(key_states, value_states, key_cache, value_cache, head_dim, num_heads, q_len, max_seq_len, past_len);
+        update_cache_kernel<<<blocks, threads, 0, str_3>>>(key_states, value_states, key_cache, value_cache, head_dim, num_kv_heads, q_len, max_seq_len, past_len);
         cudaEventRecord(sync_3, str_3);
 
         // default: wait for str_1 and str_3
diff --git a/exllama_ext/cuda_func/q4_attn.cuh b/exllama_ext/cuda_func/q4_attn.cuh
@@ -29,6 +29,7 @@ void q4_attn_cuda
     const int dim,
     const int head_dim,
     const int num_heads,
+    const int num_kv_heads,
     const int past_len,
     half* key_cache,
     half* value_cache,
diff --git a/exllama_ext/exllama_ext.cpp b/exllama_ext/exllama_ext.cpp
@@ -437,6 +437,7 @@ void q4_attn
     int q_len,
     int past_len,
     int num_heads,
+    int num_kv_heads,
     int head_dim,
     torch::Tensor key_cache,
     torch::Tensor value_cache,
@@ -488,6 +489,7 @@ void q4_attn
         dim,
         head_dim,
         num_heads,
+        num_kv_heads,
         past_len,
         (half*) key_cache.data_ptr(),
         (half*) value_cache.data_ptr(),
diff --git a/model.py b/model.py
@@ -54,6 +54,13 @@ def __init__(self, model_config_path):
         self.rms_norm_eps = read_config["rms_norm_eps"]
         self.vocab_size = read_config["vocab_size"]
 
+        if "num_key_value_heads" in read_config:
+            self.num_key_value_heads = read_config["num_key_value_heads"]
+            self.num_key_value_groups = self.num_attention_heads // self.num_key_value_heads
+        else:
+            self.num_key_value_heads = self.num_attention_heads
+            self.num_key_value_groups = 1
+
         self.rotary_embedding_base = 10000  # Constant used for pretrained models, leave as is unless retraining
         self.head_dim = self.hidden_size // self.num_attention_heads
 
@@ -288,11 +295,23 @@ def __init__(self, config, tensors, key, sin, cos, index):
         self.index = index
 
         self.q_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".q_proj")
-        self.k_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".k_proj")
-        self.v_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_attention_heads * self.config.head_dim, False, tensors, key + ".v_proj")
+        self.k_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_key_value_heads * self.config.head_dim, False, tensors, key + ".k_proj")
+        self.v_proj = Ex4bitLinear(config, self.config.hidden_size, self.config.num_key_value_heads * self.config.head_dim, False, tensors, key + ".v_proj")
         self.o_proj = Ex4bitLinear(config, self.config.num_attention_heads * self.config.head_dim, self.config.hidden_size, False, tensors, key + ".o_proj")
 
 
+    def repeat_kv(self, hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
+
+        # TODO: This seems inefficient. It should be possible to broadcast in the attention matmul to avoid building
+        # temporary K/V tensors like this
+
+        batch, num_key_value_heads, slen, head_dim = hidden_states.shape
+        if n_rep == 1: return hidden_states
+
+        hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
+        return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+
+
     def fused(self, hidden_states, cache, buffer, input_layernorm, lora):
 
         bsz, q_len, _ = hidden_states.size()
@@ -315,9 +334,9 @@ def fused(self, hidden_states, cache, buffer, input_layernorm, lora):
 
         # Project q, k, v, apply position embeddings to k and v, update cache
 
-        query_states = torch.empty((bsz, q_len, self.config.hidden_size), dtype = torch.float16, device = hidden_states.device)
-        key_states = torch.empty((bsz, q_len, self.config.hidden_size), dtype = torch.float16, device = hidden_states.device)
-        value_states = torch.empty((bsz, q_len, self.config.hidden_size), dtype = torch.float16, device = hidden_states.device)
+        query_states = torch.empty((bsz, q_len, self.config.num_attention_heads * self.config.head_dim), dtype = torch.float16, device = hidden_states.device)
+        key_states = torch.empty((bsz, q_len, self.config.num_key_value_heads * self.config.head_dim), dtype = torch.float16, device = hidden_states.device)
+        value_states = torch.empty((bsz, q_len, self.config.num_key_value_heads * self.config.head_dim), dtype = torch.float16, device = hidden_states.device)
 
         cuda_ext.exllama_ext.q4_attn(hidden_states,
                                      input_layernorm.weight,
@@ -333,6 +352,7 @@ def fused(self, hidden_states, cache, buffer, input_layernorm, lora):
                                      q_len,
                                      past_len,
                                      self.config.num_attention_heads,
+                                     self.config.num_key_value_heads,
                                      self.config.head_dim,
                                      cache.key_states[self.index],
                                      cache.value_states[self.index],
@@ -349,11 +369,16 @@ def fused(self, hidden_states, cache, buffer, input_layernorm, lora):
         key_states = cache.key_states[self.index].narrow(2, 0, past_len + q_len)
         value_states = cache.value_states[self.index].narrow(2, 0, past_len + q_len)
 
+        # Repeat K/V heads if num_key_value_headsn_kv_heads < n_heads
+
+        query_states.transpose_(1, 2)
+        key_states = self.repeat_kv(key_states, self.config.num_key_value_groups)
+        value_states = self.repeat_kv(value_states, self.config.num_key_value_groups)
+
         # Attention
         # TODO: Figure out if we can use cublasHgemmStridedBatched() to do this matmul without reshaping. Torch uses
         # gemmStridedBatchedEx() internally, so it should be possible.
 
-        query_states.transpose_(1, 2)
         key_states.transpose_(2, 3)
         attn_weights = torch.matmul(query_states, key_states)
         attn_weights /= math.sqrt(self.config.head_dim)
@@ -383,11 +408,11 @@ def forward(self, hidden_states, cache, buffer, lora):
         key_states = self.k_proj.forward(hidden_states, lora)
 
         cuda_ext.exllama_ext.rope_(query_states, self.sin, self.cos, past_len, self.config.num_attention_heads, self.config.head_dim)
-        cuda_ext.exllama_ext.rope_(key_states, self.sin, self.cos, past_len, self.config.num_attention_heads, self.config.head_dim)
+        cuda_ext.exllama_ext.rope_(key_states, self.sin, self.cos, past_len, self.config.num_key_value_heads, self.config.head_dim)
 
         query_states = query_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
-        key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
-        value_states = self.v_proj.forward(hidden_states, lora).view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.config.num_key_value_heads, self.config.head_dim).transpose(1, 2)
+        value_states = self.v_proj.forward(hidden_states, lora).view(bsz, q_len, self.config.num_key_value_heads, self.config.head_dim).transpose(1, 2)
 
         # Add keys and values to cache
 
@@ -401,6 +426,11 @@ def forward(self, hidden_states, cache, buffer, lora):
         key_states = cache.key_states[self.index].narrow(2, 0, past_len + q_len)
         value_states = cache.value_states[self.index].narrow(2, 0, past_len + q_len)
 
+        # Repeat K/V heads if num_key_value_headsn_kv_heads < n_heads
+
+        key_states = self.repeat_kv(key_states, self.config.num_key_value_groups)
+        value_states = self.repeat_kv(value_states, self.config.num_key_value_groups)
+
         # Attention
 
         # -- HF Transformers regular attention, faster on shorter sequences, same VRAM usage
@@ -508,8 +538,8 @@ def __init__(self, model, batch_size = 1, max_seq_len = -1, copy_from = None):
 
             if copy_from is None:
 
-                p_key_states = torch.zeros(self.batch_size, self.config.num_attention_heads, self.max_seq_len, self.config.head_dim, dtype = torch.float16, device = self.model.config.device_map.layers[i])
-                p_value_states = torch.zeros(self.batch_size, self.config.num_attention_heads, self.max_seq_len, self.config.head_dim, dtype = torch.float16, device = self.model.config.device_map.layers[i])
+                p_key_states = torch.zeros(self.batch_size, self.config.num_key_value_heads, self.max_seq_len, self.config.head_dim, dtype = torch.float16, device = self.model.config.device_map.layers[i])
+                p_value_states = torch.zeros(self.batch_size, self.config.num_key_value_heads, self.max_seq_len, self.config.head_dim, dtype = torch.float16, device = self.model.config.device_map.layers[i])
 
             else:
 
@@ -520,6 +550,13 @@ def __init__(self, model, batch_size = 1, max_seq_len = -1, copy_from = None):
             self.value_states.append(p_value_states)
 
 
+    def zero(self):
+
+        for i in range(self.config.num_hidden_layers):
+            self.key_states[i].zero_()
+            self.value_states[i].zero_()
+
+
     def clone(self):
 
         new = ExLlamaCache(self.model, batch_size = self.batch_size, max_seq_len = self.max_seq_len, copy_from = self)
diff --git a/perplexity.py b/perplexity.py
@@ -14,7 +14,7 @@
 '''
 
 class Perplexity:
-    def __init__(self, method="default", model=None, cache=None, tokenizer=None):
+    def __init__(self, method="default", model = None, cache = None, tokenizer = None):
         # This needs to be loaded by calling .load()
         self.dataset_chunks = []
 
@@ -36,7 +36,7 @@ def _next_logits(self, input_ids, apply_lora, last_id_only = True):
         # n_logits = []
         # a = 0
         # while a < input_ids.shape[-1]:
-        #     b = min(input_ids.shape[-1], a + 2048)  # TODO: Should this be a config parameter?
+        #     b = min(input_ids.shape[-1], a + 2048)
         #     n_logits.append(self.model.forward(input_ids[:, a:b], self.cache, last_id_only, lora = apply_lora))
         #     a = b
         #
diff --git a/test_benchmark_inference.py b/test_benchmark_inference.py
@@ -129,6 +129,9 @@ def mem(name, total = False):
 torch.cuda.reset_peak_memory_stats("cuda")
 mem("Model")
 
+cache = ExLlamaCache(model)
+mem("Cache")
+
 # Load LoRA
 
 lora = None
@@ -230,8 +233,10 @@ def mem(name, total = False):
 
     begin()
 
+    ppl.cache.zero()
     model.config.matmul_recons_thd = 1
     ppl.test(8, lora = lora, tag = " (reconstruct)")
+    ppl.cache.zero()
     model.config.matmul_recons_thd = 0
     ppl.test(8, lora = lora, tag = " (quant, token)", ppl_token = True)
 
diff --git a/tokenizer.py b/tokenizer.py
@@ -8,11 +8,17 @@ def __init__(self, tokenizer_model_path):
 
         self.path = tokenizer_model_path
         self.tokenizer = SentencePieceProcessor(model_file = self.path)
+
+        self.unk_token = "<unk>"
+        self.bos_token = "<s>"
+        self.eos_token = "</s>"
+        self.unk_token_id = self.tokenizer.unk_id()
         self.eos_token_id = self.tokenizer.eos_id()
         self.bos_token_id = self.tokenizer.bos_id()
-        self.pad_token_id = 0
+        self.pad_token_id = 0  # self.tokenizer.pad_id()
         self.newline_token_id = 13
 
+
     # Encode string
 
     def encode(self, text):
@@ -21,22 +27,22 @@ def encode(self, text):
 
             # text is a list of strings
 
-            list_ids = self.tokenizer.Encode(text)
+            list_ids = self.tokenizer.EncodeAsIds(text)
             max_length = max([len(ids) for ids in list_ids])
 
             padded_ids = []
             for ids in list_ids:
                 padding = torch.full((max_length - len(ids),), self.pad_token_id)
                 sequence = torch.tensor(ids)
-                padded_ids.append(torch.cat((padding, sequence), dim = 0))
+                padded_ids.append(torch.cat((padding, sequence), dim = 0).long())
 
             return torch.stack(padded_ids, dim = 0)
 
         else:
 
             # text is a single string
 
-            ids = self.tokenizer.Encode(text)
+            ids = self.tokenizer.EncodeAsIds(text)
             return torch.tensor(ids).unsqueeze(0)
 
     def decode(self, ids):