diff --git a/README.md b/README.md index cd9b830..a5ccb3b 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,43 @@ - # 🎯 Pico: Tiny Language Models for Learning Dynamics Research +# Pico: Tiny Language Models for Learning Dynamics Research -Pico is a framework for training and analyzing small language models, designed with clarity and educational purposes in mind. Built on a LLAMA-style architecture, Pico makes it easy to experiment with and understand transformer-based language models. +> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released on [HuggingFace organization](https://huggingface.co/pico-lm) in January 2025. + +Pico is a framework designed to facilitate research into language model learning dynamics through a comprehensive suite of small to medium-scale models (1M-1B parameters). Built on a LLAMA-style architecture, Pico emphasizes simplicity, modularity, and research accessibility. + +The framework serves two key purposes: +1. **Pre-trained Model Suite**: Access our complete suite of models trained on 420B tokens +2. **Training Framework**: Easily train your own model suite from scratch with minimal setup + +This dual-purpose design means researchers can either: +- Use our pre-trained models and checkpoints for immediate analysis +- Train their own suite of models to test specific hypotheses or explore different architectures + +## 🔄 Training Philosophy + +All models in a Pico suite (whether our pre-trained ones or your custom trained ones): +- Share identical architectures and optimizers +- Train on the same tokens in identical order +- Save rich checkpoint data including activations and gradients +- Enable direct comparisons across model scales + +## 📦 Resources + +All our pre-trained models and datasets are publicly available through our [HuggingFace organization](https://huggingface.co/pico-lm): +- Pre-trained models (1M to 1B parameters) +- Pre-tokenized training data derived from the DOLMA corpus +- Training checkpoints with activation and gradient information +- Basic evaluation (perplexity) metrics logged throughout training + +## 🌟 Why Pico? + +Unlike other model suites, Pico is specifically designed for learning dynamics research: + +1. **Focused Scale Range**: Covers the critical 1M-1B parameter range where most learning dynamics research is feasible +2. **Consistent Training**: All models see identical data in identical order, enabling true cross-scale comparisons +3. **Rich Analytics**: Automatic saving of activations and gradients for mechanistic interpretability +4. **Research Ready**: Minimal, well-documented code designed to be forked and modified +5. **Clean Data**: Uses a curated, pre-shuffled version of the DOLMA corpus +6. **Train Your Own**: Simple pipeline for training your own suite of models with custom configurations ## 🔑 Key Features @@ -18,6 +55,69 @@ Pico is a framework for training and analyzing small language models, designed w - **SwiGLU activation** function - **Residual connections** throughout +## 🚀 Quick Start + +1. **Clone Project** +```bash +git clone https://github.com/rdiehlmartinez/pico.git && cd pico +``` + +2. **Configure Environment** +Create `.env` file: +```bash +export HF_TOKEN=your_huggingface_token +export WANDB_API_KEY=your_wandb_key +``` + +3. **Setup Dependencies** +```bash +source setup.sh +``` +### Exploring the Codebase + +The core implementation is organized into these key files: + +- **`model.py`**: The heart of Pico + - LLAMA-style transformer implementation + - Attention mechanism with KV-cache + - RoPE positional embeddings + - Documentation references for each component + +- **`train.py`**: Training pipeline + - Distributed training setup + - Checkpoint management + - Logging configuration + +- **`config.py`**: Model configuration + - Hyperparameter definitions + - Model architecture settings + - Training parameters + +### Common Starting Points + +1. **Using Pre-trained Models** +```python +from transformers import AutoModelForCausalLM + +# Load a specific model size +model = AutoModelForCausalLM.from_pretrained("pico-lm/[...]") +``` + +2. **Training Your Own Suite** +```bash +# Edit config/train.yaml to customize your training +python train.py --config configs/train.yaml +``` + + +## 📊 Coming Soon: Pico Analysis + +A companion framework for analyzing Pico checkpoints: +- Mechanistic interpretability tools +- Learning dynamics visualization +- Cross-scale model comparisons +- Training trajectory analysis + ## 📚 References Our implementation draws inspiration from and builds upon: @@ -27,11 +127,12 @@ Our implementation draws inspiration from and builds upon: ## 🤝 Contributing -We welcome contributions! Whether it's: -- Adding new features -- Improving documentation -- Fixing bugs -- Sharing experimental results +We welcome contributions in: +- New features and improvements +- Documentation and tutorials +- Bug fixes and testing +- Research findings and analysis + ## 📝 License @@ -40,14 +141,16 @@ Apache 2.0 License ## 📫 Contact - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico) -- Author: Richard Diehl Martinez +- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com) ## 🔍 Citation If you use Pico in your research, please cite: +```bibtex @software{pico2024, -author = {Martinez, Richard Diehl}, -title = {Pico: Framework for Training Tiny Language Models}, -year = {2024}, + author = {Martinez, Richard Diehl}, + title = {Pico: Framework for Training Tiny Language Models}, + year = {2024}, } +``` \ No newline at end of file