rdiehlmartinez · rdiehlmartinez · Dec 11, 2024 · Nov 28, 2024 · Nov 28, 2024 · Nov 30, 2024
diff --git a/.gitignore b/.gitignore
@@ -26,6 +26,8 @@ share/python-wheels/
 *.egg
 MANIFEST
 
+poetry.lock
+
 # PyInstaller
 #  Usually these files are written by a python script from a template
 #  before PyInstaller builds the exe, so as to inject date/other infos into it.

diff --git a/README.md b/README.md
@@ -1,6 +1,43 @@
- # 🎯 Pico: Tiny Language Models for Learning Dynamics Research
+# Pico: Tiny Language Models for Learning Dynamics Research
 
-Pico is a framework for training and analyzing small language models, designed with clarity and educational purposes in mind. Built on a LLAMA-style architecture, Pico makes it easy to experiment with and understand transformer-based language models.
+> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released on [HuggingFace organization](https://huggingface.co/pico-lm) in January 2025.
+
+Pico is a framework designed to facilitate research into language model learning dynamics through a comprehensive suite of small to medium-scale models (1M-1B parameters). Built on a LLAMA-style architecture, Pico emphasizes simplicity, modularity, and research accessibility.
+
+The framework serves two key purposes:
+1. **Pre-trained Model Suite**: Access our complete suite of models trained on 420B tokens
+2. **Training Framework**: Easily train your own model suite from scratch with minimal setup
+
+This dual-purpose design means researchers can either:
+- Use our pre-trained models and checkpoints for immediate analysis
+- Train their own suite of models to test specific hypotheses or explore different architectures
+
+## 🔄 Training Philosophy
+
+All models in a Pico suite (whether our pre-trained ones or your custom trained ones):
+- Share identical architectures and optimizers
+- Train on the same tokens in identical order
+- Save rich checkpoint data including activations and gradients
+- Enable direct comparisons across model scales
+
+## 📦 Resources
+
+All our pre-trained models and datasets are publicly available through our [HuggingFace organization](https://huggingface.co/pico-lm):
+- Pre-trained models (1M to 1B parameters)
+- Pre-tokenized training data derived from the DOLMA corpus
+- Training checkpoints with activation and gradient information
+- Basic evaluation (perplexity) metrics logged throughout training
+
+## 🌟 Why Pico?
+
+Unlike other model suites, Pico is specifically designed for learning dynamics research:
+
+1. **Focused Scale Range**: Covers the critical 1M-1B parameter range where most learning dynamics research is feasible
+2. **Consistent Training**: All models see identical data in identical order, enabling true cross-scale comparisons
+3. **Rich Analytics**: Automatic saving of activations and gradients for mechanistic interpretability
+4. **Research Ready**: Minimal, well-documented code designed to be forked and modified
+5. **Clean Data**: Uses a curated, pre-shuffled version of the DOLMA corpus
+6. **Train Your Own**: Simple pipeline for training your own suite of models with custom configurations
 
 ## 🔑 Key Features
 
@@ -18,6 +55,69 @@ Pico is a framework for training and analyzing small language models, designed w
 - **SwiGLU activation** function
 - **Residual connections** throughout
 
+## 🚀 Quick Start
+
+1. **Clone Project**
+```bash
+git clone https://github.com/rdiehlmartinez/pico.git && cd pico
+```
+
+2. **Configure Environment**
+Create `.env` file:
+```bash
+export HF_TOKEN=your_huggingface_token
+export WANDB_API_KEY=your_wandb_key
+```
+
+3. **Setup Dependencies**
+```bash
+source setup.sh
+```
+### Exploring the Codebase
+
+The core implementation is organized into these key files and packages:
+
+- **`src/model/pico.py`**: The heart of Pico
+  - LLAMA-style transformer implementation
+  - Attention mechanism with KV-cache
+  - RoPE positional embeddings
+  - Documentation references for each component
+
+- **`src/training/trainer.py`**: Training pipeline
+  - Distributed training setup
+  - Checkpoint management
+  - Logging configuration
+
+- **`src/config`**: Model configuration
+  - Hyperparameter definitions
+  - Model architecture settings
+  - Training parameters
+
+### Common Starting Points
+
+1. **Using Pre-trained Models**
+```python
+from transformers import AutoModelForCausalLM
+
+# Load a specific model size
+model = AutoModelForCausalLM.from_pretrained("pico-lm/[...]")
+```
+
+2. **Training Your Own Suite**
+```bash
+# Edit config/train.yaml to customize your training
+poetry run train --config_path configs/train.yaml
+```
+
+
+## 📊 Coming Soon: Pico Analysis
+
+A companion framework for analyzing Pico checkpoints:
+- Mechanistic interpretability tools
+- Learning dynamics visualization
+- Cross-scale model comparisons
+- Training trajectory analysis
+
 ## 📚 References
 
 Our implementation draws inspiration from and builds upon:
@@ -27,11 +127,12 @@ Our implementation draws inspiration from and builds upon:
 
 ## 🤝 Contributing
 
-We welcome contributions! Whether it's:
-- Adding new features
-- Improving documentation
-- Fixing bugs
-- Sharing experimental results
+We welcome contributions in:
+- New features and improvements
+- Documentation and tutorials
+- Bug fixes and testing
+- Research findings and analysis
+
 
 ## 📝 License
 
@@ -40,14 +141,16 @@ Apache 2.0 License
 ## 📫 Contact
 
 - GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
-- Author: Richard Diehl Martinez
+- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)
 
 ## 🔍 Citation
 
 If you use Pico in your research, please cite:
 
+```bibtex
 @software{pico2024,
-author = {Martinez, Richard Diehl},
-title = {Pico: Framework for Training Tiny Language Models},
-year = {2024},
+    author = {Diehl Martinez, Richard},
+    title = {Pico: Framework for Training Tiny Language Models},
+    year = {2024},
 }
+```
diff --git a/config.py b/config.py