Distributed Training GPT-2 with minGPT

This repo is an adaptation of Andrej Karpathy's MinGPT project. It uses the @torchrun decorator with @kubernetes on Metaflow to train a MinGPT model with distributed training.

Many of the files in this example have been directly sourced from the MinGPT project with minimal or no adjustments. The gpt2_train_cfg.yaml, char_dataset.py, model.py, trainer.py, main.py have been sourced from the MinGPT project. The flow.py and flow_oss.py uses the minGPT's CLI script via Metaflow's @torchrun decorator.

Running with Open source Metaflow on Kubernetes

python flow_oss.py run

Running on the Outerbounds Platform

python flow.py --environment=fast-bakery run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!