llm-efficiency-challenge · chauhang · Nov 19, 2023 · Nov 19, 2023
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1.zip b/4090/ycchen-tw/4090_submissions/4090_submission_1.zip
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/Dockerfile b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/Dockerfile
@@ -0,0 +1,21 @@
+FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
+
+RUN apt-get update && apt-get install -y git python3-virtualenv wget 
+
+# RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
+RUN pip install -U transformers==4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy
+RUN pip install -U auto-gptq optimum peft
+
+WORKDIR /workspace
+# Setup server requriements
+COPY ./fast_api_requirements.txt fast_api_requirements.txt
+RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt
+
+ENV HUGGINGFACE_TOKEN="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"
+ENV HUGGINGFACE_REPO="gchauhan/ycchen-4"
+
+# Copy over single file server
+COPY ./main.py main.py
+COPY ./api.py api.py
+# Run the server
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/Dockerfile.train b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/Dockerfile.train
@@ -0,0 +1,16 @@
+FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
+
+RUN apt-get update  && apt-get install -y git python3-virtualenv wget
+
+RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
+
+WORKDIR /workspace
+
+RUN wget https://gist.githubusercontent.com/mreso/ec65015cbfbd395f0c2adc17147adf1f/raw/41070f1058820b9e89bae885968cc666a7d6aa59/custom_dataset.py
+
+ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
+ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
+
+COPY train.py ./
+
+CMD [ "python", "train.py"]
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/README.md b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/README.md
@@ -0,0 +1,71 @@
+# Llama-recipes Example
+This example demonstrates how to fine-tune and serve a Llama 2 model with llama-recipes for submission in the LLM efficiency challenge using the [lit-gpt](../lit-gpt/) example as a template.
+Llama-recipes provides an easy way to fine-tune a Llama 2 model with custom datasets using efficient techniques like LoRA or Llama-adapters.
+
+# Getting Started
+In order to use llama-recipes we need to install the following pip package:
+
+```
+pip install llama-recipes
+```
+
+To obtain access to the model weights you need to fill out this [form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to accept the license terms and acceptable use policy.
+
+After access has been granted, you need to acknowledge this in your HuggingFace account for the model you want to fine-tune. In this example we will continue with the 7B parameter version available under this identifier: meta-llama/Llama-2-7b-hf
+
+**NOTE** In this example the training result will be uploaded and downloaded through huggingface_hub. The authentication will be done through a token created in the settings of your HuggingFace account.
+Make sure to give write access to the token and set the env variables in the Dockerfiles to your token and repo:
+
+```bash
+ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
+ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
+```
+
+# Fine-tune The Model
+With llama-recipes its possible to fine-tune Llama on custom data with a single command. To fine-tune on a custom dataset we need to implement a function (get_custom_dataset) that provides the custom dataset following this example [custom_dataset.py](https://github.com/facebookresearch/llama-recipes/blob/main/examples/custom_dataset.py).
+We can then train on this dataset using this command line:
+
+```bash
+python3 -m llama_recipes.finetuning  --use_peft --peft_method lora --quantization --model_name meta-llama/Llama-2-7b --dataset custom_dataset --custom_dataset.file /workspace/custom_dataset.py --output_dir /volume/output_dir
+```
+
+**Note** The custom dataset in this example is dialog based. This is only due to the nature of the example but not a necessity of the custom dataset functionality. To see other examples of get_custom_dataset functions (btw the name of the function get_custom_dataset can be changed in the command line by using this syntax: /workspace/custom_dataset.py:get_foo_dataset) have a look at the [built-in dataset in llama-recipes](https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/datasets/__init__.py).
+
+# Create Submission
+*Note* For a submission to the competition only the inference part (Dockerfile) will be necessary. A training docker (Dockerfile.train) will only be necessary if you need to replicate the submission in case you're within the top 3 contestants.
+
+## Prepare Leaderboard Submission
+The inference Docker will download base and LoRA weights from huggingface_hub. For the submission it is assumed that the trained weights are uploaded to a repo on huggingface_hub and the env variables HUGGINGFACE_TOKEN and HUGGINGFACE_REPO have been updated accordingly in the [Dockerfile](./Dockerfile).
+
+To create the zip file for submission to the eval bot use the following commands:
+```bash
+cd neurips_llm_efficiency_challenge/sample-submissions
+rm llama_recipes/Dockerfile.train
+zip -r llama_recipes.zip llama_recipes
+```
+*Note* 1. Make sure to only zip the folder llama_recipes and do not include any other sample submission in the zipfile. 2. We delete llama_recipes/Dockerfile.train as a precaution to avoid errors if submission logic changes.
+
+## Run Training And Inference Docker Locally
+To locally build and and run the taining Docker we need to execute:
+
+```bash
+docker build -f ./Dockerfile.train -t llama_recipes_train .
+
+docker run --gpus "device=0" --rm -ti llama_recipes_train
+```
+
+The inference Docker can be created and started locally with:
+
+```bash
+docker build -f ./Dockerfile -t llama_recipes_inference .
+
+docker run --gpus "device=0" -p 8080:80 --rm -ti llama_recipes_inference
+```
+
+To test the inference docker we can run this query:
+
+```bash
+curl -X POST -H "Content-Type: application/json" -d '{"text": "What is the capital of france? "}' http://localhost:8080/tokenize
+OR
+curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of france? "}' http://localhost:8080/process
+```
diff --git a/...chen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/__pycache__/api.cpython-311.pyc b/...chen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/__pycache__/api.cpython-311.pyc
diff --git a/...hen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/__pycache__/main.cpython-311.pyc b/...hen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/__pycache__/main.cpython-311.pyc
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/api.py b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/api.py
@@ -0,0 +1,38 @@
+from pydantic import BaseModel
+
+from typing import List, Dict, Optional
+
+
+class ProcessRequest(BaseModel):
+    prompt: str
+    num_samples: int = 1
+    max_new_tokens: int = 50
+    top_k: int = 200
+    temperature: float = 0.8
+    seed: Optional[int] = None
+    echo_prompt: Optional[bool]
+    stop_sequences: List[str] = None
+
+
+class Token(BaseModel):
+    text: str
+    logprob: float
+    top_logprob: Dict[str, float]
+
+
+class ProcessResponse(BaseModel):
+    text: str
+    tokens: List[Token]
+    logprob: float
+    request_time: float
+
+
+class TokenizeRequest(BaseModel):
+    text: str
+    truncation: bool = True
+    max_length: int = 2048
+
+
+class TokenizeResponse(BaseModel):
+    tokens: List[int]
+    request_time: float
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/fast_api_requirements.txt b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/fast_api_requirements.txt
@@ -0,0 +1,4 @@
+# FAST API
+fastapi>=0.68.0,<0.69.0
+pydantic>=1.8.0,<2.0.0
+uvicorn>=0.15.0,<0.16.0
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/main.py b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/main.py
@@ -0,0 +1,152 @@
+from fastapi import FastAPI
+
+import logging
+import os
+# os.environ['CUDA_VISIBLE_DEVICES'] = '1'
+import time
+
+import torch
+from huggingface_hub import login
+from transformers import LlamaTokenizer, LlamaForCausalLM, AutoModelForCausalLM, AutoTokenizer
+from transformers.generation.stopping_criteria import StoppingCriteria
+from peft import PeftModel
+
+torch.set_float32_matmul_precision("high")
+
+from api import (
+    ProcessRequest,
+    ProcessResponse,
+    TokenizeRequest,
+    TokenizeResponse,
+    Token,
+)
+
+app = FastAPI()
+
+logger = logging.getLogger(__name__)
+# Configure the logging module
+logging.basicConfig(level=logging.INFO)
+
+# login(token=os.environ["HUGGINGFACE_TOKEN"])
+
+#MODEL_PATH = "ycchen/yc-test1"
+MODEL_PATH = os.environ["HUGGINGFACE_REPO"]
+#LORA_PATH = "ycchen/final-lora-r64-ep5"
+
+print('Model:', MODEL_PATH)
+#print('LoRA:', LORA_PATH)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_PATH,
+    device_map="auto",
+    trust_remote_code=True,
+)
+model.eval()
+
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+class StopAtSpecificTokenCriteria(StoppingCriteria):
+    def __init__(self, stop_sequences, tokenizer):
+        super().__init__()
+        self.stop_sequences = stop_sequences
+        self.tokenizer = tokenizer
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        last_token = tokenizer._convert_id_to_token(input_ids[0, -1].item())
+        if isinstance(last_token, bytes):
+            last_token = last_token.decode()
+        to_stop = any([ss in last_token for ss in self.stop_sequences])
+        return to_stop
+
+LLAMA2_CONTEXT_LENGTH = 4096
+
+
+@app.post("/process")
+async def process_request(input_data: ProcessRequest) -> ProcessResponse:
+    if input_data.seed is not None:
+        torch.manual_seed(input_data.seed)
+
+    # print(input_data)
+
+    encoded = tokenizer(input_data.prompt, return_tensors="pt")
+
+    prompt_length = encoded["input_ids"][0].size(0)
+    max_returned_tokens = prompt_length + input_data.max_new_tokens
+    assert max_returned_tokens <= LLAMA2_CONTEXT_LENGTH, (
+        max_returned_tokens,
+        LLAMA2_CONTEXT_LENGTH,
+    )
+
+    stop_sequences = input_data.stop_sequences if input_data.stop_sequences is not None else []
+    stopping_criteria = StopAtSpecificTokenCriteria(
+        stop_sequences=stop_sequences,
+        tokenizer=tokenizer,
+    )
+
+    t0 = time.perf_counter()
+    encoded = {k: v.to("cuda") for k, v in encoded.items()}
+    with torch.no_grad():
+        eos_token_id = [151643]
+
+        temperature = max(1e-3, input_data.temperature )
+        outputs = model.generate(
+            **encoded,
+            max_new_tokens=input_data.max_new_tokens,
+            do_sample=True,
+            temperature=temperature,
+            top_k=input_data.top_k,
+            min_new_tokens=1,
+            eos_token_id=eos_token_id,
+            stopping_criteria=[stopping_criteria],
+            return_dict_in_generate=True,
+            output_scores=True,
+        )
+        # outputs.sequences = outputs.sequences[:, :-1]
+        # outputs.scores = outputs.scores[:-1]
+
+    t = time.perf_counter() - t0
+    if not input_data.echo_prompt:
+        output = tokenizer.decode(outputs.sequences[0][prompt_length:], skip_special_tokens=True)
+    else:
+        output = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
+
+    tokens_generated = outputs.sequences[0].size(0) - prompt_length
+    logger.info(
+        f"Time for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec"
+    )
+
+    logger.info(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB")
+    generated_tokens = []
+
+    log_probs = torch.log(torch.stack(outputs.scores, dim=1).softmax(-1))
+
+    gen_sequences = outputs.sequences[:, encoded["input_ids"].shape[-1]:]
+    gen_logprobs = torch.gather(log_probs, 2, gen_sequences[:, :, None]).squeeze(-1)
+
+    top_indices = torch.argmax(log_probs, dim=-1)
+    top_logprobs = torch.gather(log_probs, 2, top_indices[:,:,None]).squeeze(-1)
+    top_indices = top_indices.tolist()[0]
+    top_logprobs = top_logprobs.tolist()[0]
+
+    for t, lp, tlp in zip(gen_sequences.tolist()[0], gen_logprobs.tolist()[0], zip(top_indices, top_logprobs)):
+        idx, val = tlp
+        tok_str = tokenizer.decode(idx)
+        token_tlp = {tok_str: val}
+        generated_tokens.append(
+            Token(text=tokenizer.decode(t), logprob=lp, top_logprob=token_tlp)
+        )
+    logprob_sum = gen_logprobs.sum().item()
+
+    return ProcessResponse(
+        text=output, tokens=generated_tokens, logprob=logprob_sum, request_time=t
+    )
+
+
+@app.post("/tokenize")
+async def tokenize(input_data: TokenizeRequest) -> TokenizeResponse:
+    t0 = time.perf_counter()
+    encoded = tokenizer(
+        input_data.text
+    )
+    t = time.perf_counter() - t0
+    tokens = encoded["input_ids"]
+    return TokenizeResponse(tokens=tokens, request_time=t)
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/train.py b/4090/ycchen-tw/4090_submissions/4090_submission_1/qwen_8bit_pure/train.py
@@ -0,0 +1,31 @@
+import os
+
+from huggingface_hub import login, HfApi 
+from llama_recipes.finetuning import main as finetuning
+
+def main():
+    login(token=os.environ["HUGGINGFACE_TOKEN"])
+
+    kwargs = {
+        "model_name": "meta-llama/Llama-2-7b-hf",
+        "use_peft": True,
+        "peft_method": "lora",
+        "quantization": True,
+        "batch_size_training": 2,
+        "dataset": "custom_dataset",
+        "custom_dataset.file": "./custom_dataset.py",
+        "output_dir": "./output_dir",
+    }
+
+    finetuning(**kwargs)
+
+    api = HfApi() 
+
+    api.upload_folder( 
+        folder_path='./output_dir/', 
+        repo_id=os.environ["HUGGINGFACE_REPO"], 
+        repo_type='model', 
+    )
+
+if __name__ == "__main__":
+    main()
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_2.zip b/4090/ycchen-tw/4090_submissions/4090_submission_2.zip
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_2/qwen_8bit_qlora_r64_ep5/Dockerfile b/4090/ycchen-tw/4090_submissions/4090_submission_2/qwen_8bit_qlora_r64_ep5/Dockerfile
@@ -0,0 +1,21 @@
+FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
+
+RUN apt-get update && apt-get install -y git python3-virtualenv wget 
+
+# RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
+RUN pip install -U transformers==4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy
+RUN pip install -U auto-gptq optimum peft
+
+WORKDIR /workspace
+# Setup server requriements
+COPY ./fast_api_requirements.txt fast_api_requirements.txt
+RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt
+
+ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
+ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
+
+# Copy over single file server
+COPY ./main.py main.py
+COPY ./api.py api.py
+# Run the server
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
diff --git a/4090/ycchen-tw/4090_submissions/4090_submission_2/qwen_8bit_qlora_r64_ep5/Dockerfile.train b/4090/ycchen-tw/4090_submissions/4090_submission_2/qwen_8bit_qlora_r64_ep5/Dockerfile.train
@@ -0,0 +1,16 @@
+FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel
+
+RUN apt-get update  && apt-get install -y git python3-virtualenv wget
+
+RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
+
+WORKDIR /workspace
+
+RUN wget https://gist.githubusercontent.com/mreso/ec65015cbfbd395f0c2adc17147adf1f/raw/41070f1058820b9e89bae885968cc666a7d6aa59/custom_dataset.py
+
+ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
+ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
+
+COPY train.py ./
+
+CMD [ "python", "train.py"]