Skip to content

pattplatt/llm_dataset_creation_and_finetuning

Repository files navigation

The included scripts were generated in a study project, where a dataset from Aalen University was generated using the ChatGPT API. The dataset was then finetuned on a small LLM. The final work is in the repo and linked here.

Use the data generator script

To use dataset_generator.ipynb you have to get an OpenAI API key and import it into the script or set it as an environment variable.

How to finetune with the dataset

Here we finetune TinyLlama-1.1B using the litgpt framework:

Create a new conda environment and install the needed packages:

conda env -n finetuning python==3.10 
pip install requirements.txt

Now the model weights must be downloaded with:

litgpt download --repo_id TinyLlama/TinyLlama-1.1B-Chat-v1.0

Run the following in the terminal to start fine-tuning with the dataset:

litgpt finetune lora --checkpoint_dir '/mnt/d/dev/llm_dataset_testing/checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0' --data JSON --data.json_path data/ --train.micro_batch_size 1 --train.global_batch_size 1 --train.epochs 1

The training took ~13 minutes on a 3060Ti and ~48 minutes on a M1 Pro, using 7.45 GB of memory.

You can use the eval.ipynb to evaluate the finetuning. To do this the weights need to be converted into the HuggingFace Transformer format:

litgpt convert from_litgpt     --checkpoint_dir out/finetune/lora/final    --output_dir out/hf_checkpoint

About

Generate a dataset for LLM finetuning with ChatGPT

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published