A compact PyTorch implementation of a GPT-style language model for text generation. The project includes a training CLI, multiple tokenizer options, and a saved model/tokenizer pair you can use to generate text right away.
- Transformer-based autoregressive language model in PyTorch
- Training and generation commands in
main.py - Multiple tokenizer backends: character-level, BPE, and tiktoken-based
- Reference notes in
docs/for both the neural-network foundations and the GPT architecture
main.py- command-line entry point for training and generationnanogpt/- model, training loop, and tokenizer implementationsdocs/- technical background and mathematical notes
Install the Python dependencies listed in requirements.txt.
To train a model with all Shakespeare's works:
mkdir input
curl -s https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt | \
sed -E 's/\b[A-Z]+\b//g' | tail -n +241 | sed -E "s/^\s\s*//" > input/shakespeare.txt
python main.py training \
--dataset input/shakespeare.txt \
--model trained_model.pt \
--tokenizer tokenizer.json \
--tokenizer_type bpeTokenizer-specific training options are exposed with a --<tokenizer>:<option> prefix.
python main.py generate \
--model trained_model.pt \
--tokenizer tokenizer.json \
--tokens 200- docs/GPT_README.md for the GPT architecture and training overview
- docs/NeuralNetwork_README.md for the underlying neural-network math
This started as an example used in a tutorial from Gabriel Merlo. The initial code can be found here.