Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

RUN apt-get update && apt-get install -y git python3-virtualenv wget

# RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
RUN pip install -U transformers==4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy
RUN pip install -U auto-gptq optimum peft

WORKDIR /workspace
# Setup server requriements
COPY ./fast_api_requirements.txt fast_api_requirements.txt
RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt

ENV HUGGINGFACE_TOKEN="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"
ENV HUGGINGFACE_REPO="gchauhan/ycchen-4"

# Copy over single file server
COPY ./main.py main.py
COPY ./api.py api.py
# Run the server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

RUN apt-get update && apt-get install -y git python3-virtualenv wget

RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793

WORKDIR /workspace

RUN wget https://gist.githubusercontent.com/mreso/ec65015cbfbd395f0c2adc17147adf1f/raw/41070f1058820b9e89bae885968cc666a7d6aa59/custom_dataset.py

ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"

COPY train.py ./

CMD [ "python", "train.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Llama-recipes Example
This example demonstrates how to fine-tune and serve a Llama 2 model with llama-recipes for submission in the LLM efficiency challenge using the [lit-gpt](../lit-gpt/) example as a template.
Llama-recipes provides an easy way to fine-tune a Llama 2 model with custom datasets using efficient techniques like LoRA or Llama-adapters.

# Getting Started
In order to use llama-recipes we need to install the following pip package:

```
pip install llama-recipes
```

To obtain access to the model weights you need to fill out this [form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) to accept the license terms and acceptable use policy.

After access has been granted, you need to acknowledge this in your HuggingFace account for the model you want to fine-tune. In this example we will continue with the 7B parameter version available under this identifier: meta-llama/Llama-2-7b-hf

**NOTE** In this example the training result will be uploaded and downloaded through huggingface_hub. The authentication will be done through a token created in the settings of your HuggingFace account.
Make sure to give write access to the token and set the env variables in the Dockerfiles to your token and repo:

```bash
ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"
```

# Fine-tune The Model
With llama-recipes its possible to fine-tune Llama on custom data with a single command. To fine-tune on a custom dataset we need to implement a function (get_custom_dataset) that provides the custom dataset following this example [custom_dataset.py](https://github.com/facebookresearch/llama-recipes/blob/main/examples/custom_dataset.py).
We can then train on this dataset using this command line:

```bash
python3 -m llama_recipes.finetuning --use_peft --peft_method lora --quantization --model_name meta-llama/Llama-2-7b --dataset custom_dataset --custom_dataset.file /workspace/custom_dataset.py --output_dir /volume/output_dir
```

**Note** The custom dataset in this example is dialog based. This is only due to the nature of the example but not a necessity of the custom dataset functionality. To see other examples of get_custom_dataset functions (btw the name of the function get_custom_dataset can be changed in the command line by using this syntax: /workspace/custom_dataset.py:get_foo_dataset) have a look at the [built-in dataset in llama-recipes](https://github.com/facebookresearch/llama-recipes/blob/main/src/llama_recipes/datasets/__init__.py).

# Create Submission
*Note* For a submission to the competition only the inference part (Dockerfile) will be necessary. A training docker (Dockerfile.train) will only be necessary if you need to replicate the submission in case you're within the top 3 contestants.

## Prepare Leaderboard Submission
The inference Docker will download base and LoRA weights from huggingface_hub. For the submission it is assumed that the trained weights are uploaded to a repo on huggingface_hub and the env variables HUGGINGFACE_TOKEN and HUGGINGFACE_REPO have been updated accordingly in the [Dockerfile](./Dockerfile).

To create the zip file for submission to the eval bot use the following commands:
```bash
cd neurips_llm_efficiency_challenge/sample-submissions
rm llama_recipes/Dockerfile.train
zip -r llama_recipes.zip llama_recipes
```
*Note* 1. Make sure to only zip the folder llama_recipes and do not include any other sample submission in the zipfile. 2. We delete llama_recipes/Dockerfile.train as a precaution to avoid errors if submission logic changes.

## Run Training And Inference Docker Locally
To locally build and and run the taining Docker we need to execute:

```bash
docker build -f ./Dockerfile.train -t llama_recipes_train .

docker run --gpus "device=0" --rm -ti llama_recipes_train
```

The inference Docker can be created and started locally with:

```bash
docker build -f ./Dockerfile -t llama_recipes_inference .

docker run --gpus "device=0" -p 8080:80 --rm -ti llama_recipes_inference
```

To test the inference docker we can run this query:

```bash
curl -X POST -H "Content-Type: application/json" -d '{"text": "What is the capital of france? "}' http://localhost:8080/tokenize
OR
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "What is the capital of france? "}' http://localhost:8080/process
```
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from pydantic import BaseModel

from typing import List, Dict, Optional


class ProcessRequest(BaseModel):
prompt: str
num_samples: int = 1
max_new_tokens: int = 50
top_k: int = 200
temperature: float = 0.8
seed: Optional[int] = None
echo_prompt: Optional[bool]
stop_sequences: List[str] = None


class Token(BaseModel):
text: str
logprob: float
top_logprob: Dict[str, float]


class ProcessResponse(BaseModel):
text: str
tokens: List[Token]
logprob: float
request_time: float


class TokenizeRequest(BaseModel):
text: str
truncation: bool = True
max_length: int = 2048


class TokenizeResponse(BaseModel):
tokens: List[int]
request_time: float
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# FAST API
fastapi>=0.68.0,<0.69.0
pydantic>=1.8.0,<2.0.0
uvicorn>=0.15.0,<0.16.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
from fastapi import FastAPI

import logging
import os
# os.environ['CUDA_VISIBLE_DEVICES'] = '1'
import time

import torch
from huggingface_hub import login
from transformers import LlamaTokenizer, LlamaForCausalLM, AutoModelForCausalLM, AutoTokenizer
from transformers.generation.stopping_criteria import StoppingCriteria
from peft import PeftModel

torch.set_float32_matmul_precision("high")

from api import (
ProcessRequest,
ProcessResponse,
TokenizeRequest,
TokenizeResponse,
Token,
)

app = FastAPI()

logger = logging.getLogger(__name__)
# Configure the logging module
logging.basicConfig(level=logging.INFO)

# login(token=os.environ["HUGGINGFACE_TOKEN"])

#MODEL_PATH = "ycchen/yc-test1"
MODEL_PATH = os.environ["HUGGINGFACE_REPO"]
#LORA_PATH = "ycchen/final-lora-r64-ep5"

print('Model:', MODEL_PATH)
#print('LoRA:', LORA_PATH)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
device_map="auto",
trust_remote_code=True,
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

class StopAtSpecificTokenCriteria(StoppingCriteria):
def __init__(self, stop_sequences, tokenizer):
super().__init__()
self.stop_sequences = stop_sequences
self.tokenizer = tokenizer

def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_token = tokenizer._convert_id_to_token(input_ids[0, -1].item())
if isinstance(last_token, bytes):
last_token = last_token.decode()
to_stop = any([ss in last_token for ss in self.stop_sequences])
return to_stop

LLAMA2_CONTEXT_LENGTH = 4096


@app.post("/process")
async def process_request(input_data: ProcessRequest) -> ProcessResponse:
if input_data.seed is not None:
torch.manual_seed(input_data.seed)

# print(input_data)

encoded = tokenizer(input_data.prompt, return_tensors="pt")

prompt_length = encoded["input_ids"][0].size(0)
max_returned_tokens = prompt_length + input_data.max_new_tokens
assert max_returned_tokens <= LLAMA2_CONTEXT_LENGTH, (
max_returned_tokens,
LLAMA2_CONTEXT_LENGTH,
)

stop_sequences = input_data.stop_sequences if input_data.stop_sequences is not None else []
stopping_criteria = StopAtSpecificTokenCriteria(
stop_sequences=stop_sequences,
tokenizer=tokenizer,
)

t0 = time.perf_counter()
encoded = {k: v.to("cuda") for k, v in encoded.items()}
with torch.no_grad():
eos_token_id = [151643]

temperature = max(1e-3, input_data.temperature )
outputs = model.generate(
**encoded,
max_new_tokens=input_data.max_new_tokens,
do_sample=True,
temperature=temperature,
top_k=input_data.top_k,
min_new_tokens=1,
eos_token_id=eos_token_id,
stopping_criteria=[stopping_criteria],
return_dict_in_generate=True,
output_scores=True,
)
# outputs.sequences = outputs.sequences[:, :-1]
# outputs.scores = outputs.scores[:-1]

t = time.perf_counter() - t0
if not input_data.echo_prompt:
output = tokenizer.decode(outputs.sequences[0][prompt_length:], skip_special_tokens=True)
else:
output = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)

tokens_generated = outputs.sequences[0].size(0) - prompt_length
logger.info(
f"Time for inference: {t:.02f} sec total, {tokens_generated / t:.02f} tokens/sec"
)

logger.info(f"Memory used: {torch.cuda.max_memory_reserved() / 1e9:.02f} GB")
generated_tokens = []

log_probs = torch.log(torch.stack(outputs.scores, dim=1).softmax(-1))

gen_sequences = outputs.sequences[:, encoded["input_ids"].shape[-1]:]
gen_logprobs = torch.gather(log_probs, 2, gen_sequences[:, :, None]).squeeze(-1)

top_indices = torch.argmax(log_probs, dim=-1)
top_logprobs = torch.gather(log_probs, 2, top_indices[:,:,None]).squeeze(-1)
top_indices = top_indices.tolist()[0]
top_logprobs = top_logprobs.tolist()[0]

for t, lp, tlp in zip(gen_sequences.tolist()[0], gen_logprobs.tolist()[0], zip(top_indices, top_logprobs)):
idx, val = tlp
tok_str = tokenizer.decode(idx)
token_tlp = {tok_str: val}
generated_tokens.append(
Token(text=tokenizer.decode(t), logprob=lp, top_logprob=token_tlp)
)
logprob_sum = gen_logprobs.sum().item()

return ProcessResponse(
text=output, tokens=generated_tokens, logprob=logprob_sum, request_time=t
)


@app.post("/tokenize")
async def tokenize(input_data: TokenizeRequest) -> TokenizeResponse:
t0 = time.perf_counter()
encoded = tokenizer(
input_data.text
)
t = time.perf_counter() - t0
tokens = encoded["input_ids"]
return TokenizeResponse(tokens=tokens, request_time=t)
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import os

from huggingface_hub import login, HfApi
from llama_recipes.finetuning import main as finetuning

def main():
login(token=os.environ["HUGGINGFACE_TOKEN"])

kwargs = {
"model_name": "meta-llama/Llama-2-7b-hf",
"use_peft": True,
"peft_method": "lora",
"quantization": True,
"batch_size_training": 2,
"dataset": "custom_dataset",
"custom_dataset.file": "./custom_dataset.py",
"output_dir": "./output_dir",
}

finetuning(**kwargs)

api = HfApi()

api.upload_folder(
folder_path='./output_dir/',
repo_id=os.environ["HUGGINGFACE_REPO"],
repo_type='model',
)

if __name__ == "__main__":
main()
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel

RUN apt-get update && apt-get install -y git python3-virtualenv wget

# RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793
RUN pip install -U transformers==4.32.0 accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy
RUN pip install -U auto-gptq optimum peft

WORKDIR /workspace
# Setup server requriements
COPY ./fast_api_requirements.txt fast_api_requirements.txt
RUN pip install --no-cache-dir --upgrade -r fast_api_requirements.txt

ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"

# Copy over single file server
COPY ./main.py main.py
COPY ./api.py api.py
# Run the server
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel

RUN apt-get update && apt-get install -y git python3-virtualenv wget

RUN pip install -U --no-cache-dir git+https://github.com/facebookresearch/llama-recipes.git@eafea7b366bde9dc3f0b66a4cb0a8788f560c793

WORKDIR /workspace

RUN wget https://gist.githubusercontent.com/mreso/ec65015cbfbd395f0c2adc17147adf1f/raw/41070f1058820b9e89bae885968cc666a7d6aa59/custom_dataset.py

ENV HUGGINGFACE_TOKEN="YOUR_TOKEN"
ENV HUGGINGFACE_REPO="YOUR_USERNAME/YOUR_REPO"

COPY train.py ./

CMD [ "python", "train.py"]
Loading