GitHub - pattplatt/llm_dataset_creation_and_finetuning: Generate a dataset for LLM finetuning with ChatGPT

The included scripts were generated in a study project, where a dataset from Aalen University was generated using the ChatGPT API. The dataset was then finetuned on a small LLM. The final work is in the repo and linked here.

Use the data generator script

To use dataset_generator.ipynb you have to get an OpenAI API key and import it into the script or set it as an environment variable.

How to finetune with the dataset

Here we finetune TinyLlama-1.1B using the litgpt framework:

Create a new conda environment and install the needed packages:

conda env -n finetuning python==3.10 
pip install requirements.txt

Now the model weights must be downloaded with:

litgpt download --repo_id TinyLlama/TinyLlama-1.1B-Chat-v1.0

Run the following in the terminal to start fine-tuning with the dataset:

litgpt finetune lora --checkpoint_dir '/mnt/d/dev/llm_dataset_testing/checkpoints/TinyLlama/TinyLlama-1.1B-Chat-v1.0' --data JSON --data.json_path data/ --train.micro_batch_size 1 --train.global_batch_size 1 --train.epochs 1

The training took ~13 minutes on a 3060Ti and ~48 minutes on a M1 Pro, using 7.45 GB of memory.

You can use the eval.ipynb to evaluate the finetuning. To do this the weights need to be converted into the HuggingFace Transformer format:

litgpt convert from_litgpt     --checkpoint_dir out/finetune/lora/final    --output_dir out/hf_checkpoint

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
plots		plots
.DS_Store		.DS_Store
Customization_of_Large_Language_Models_to_User_Defined_Data.pdf		Customization_of_Large_Language_Models_to_User_Defined_Data.pdf
dataset_generator.ipynb		dataset_generator.ipynb
dataset_plotter.ipynb		dataset_plotter.ipynb
eval.ipynb		eval.ipynb
plot_training.ipynb		plot_training.ipynb
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Use the data generator script

How to finetune with the dataset

About

Releases

Packages

Languages

pattplatt/llm_dataset_creation_and_finetuning

Folders and files

Latest commit

History

Repository files navigation

Use the data generator script

How to finetune with the dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages