A PyTorch implementation of a GPT-like language model with text preprocessing utilities.
This project implements a transformer-based language model similar to GPT, designed for character-level text generation. It includes utilities for vocabulary generation and dataset splitting.
In this example, I tested it on the fabulous book The Brothers Karamazov, downloaded from Project Gutenberg. Feel free to change the text file or even try training it on an consacrated dataset (like OpenWebText for example), though on larger datasets the vocab.py
and split.py
might not work properly.
- Character-level language modeling
- Multi-head self-attention mechanism
- Memory-efficient data loading using memory mapping
- Text preprocessing utilities
- Configurable model architecture
- Python 3.9+
- PyTorch
- Jupyter Notebooks
- CUDA (optional, for GPU acceleration, on Windows)
vocab.py
- Generates vocabulary from input textsplit.py
- Splits text data into training and validation setsGPT.ipynb
- Main model implementation and training
Run the terminal in a directory of choice.
Create a Python Virtual Environment and activate it:
python3 -m venv venv
source ./venv/bin/activate
Install the MacOS requirements:
pip3 install -r requirements_macos.txt
Install Python on your system. If you have it already, skip this step.
Install Anaconda. Follow the steps from this link.
Once installed, run Anaconda Prompt in a directory of choice.
Create a Python Virtual Environment and activate it:
python3 -m venv venv
venv\Scripts\activate
Install the Windows requirements:
pip3 install -r requirements_windows.txt
! These requirements are different. On Windows, PyTorch is installed with CUDA support, if available
First, add your desired data file and generate the vocabulary from your text:
python3 vocab.py
Then, split your data into training and validation sets:
python3 split.py
Install a new kernel to use in your Jupyter Notebook:
python3 -m ipykernel install --user --name=venv --display-name "GPTKernel"
Run Jupyter Notebook:
jupyter notebook
Open GPT.ipynb
.
Select GPTKernel
and run the cells sequentially. The notebook contains:
- Model architecture implementation
- Training loop
- Text generation functionality
The default hyperparameters are:
- Batch size: 32
- Block size: 128
- Maximum training iterations: 300
- Learning Rate: 2e-5
- Evaluation: every 50 iterations
- Embedding dimension: 300
- Number of heads: 4
- Number of layers: 4
- Dropout: 0.2
These can be adjusted based on your hardware capabilities and requirements.
The model implements a transformer architecture with:
- Multi-head self-attention
- Position embeddings
- Layer normalization
- Feed-forward networks
MIT