Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Paloma Evaluation + Restructure of Model and Train Loops #13

Merged
merged 12 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ share/python-wheels/
*.egg
MANIFEST

poetry.lock

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
Expand Down
125 changes: 114 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,43 @@
# 🎯 Pico: Tiny Language Models for Learning Dynamics Research
# Pico: Tiny Language Models for Learning Dynamics Research

Pico is a framework for training and analyzing small language models, designed with clarity and educational purposes in mind. Built on a LLAMA-style architecture, Pico makes it easy to experiment with and understand transformer-based language models.
> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released on [HuggingFace organization](https://huggingface.co/pico-lm) in January 2025.

Pico is a framework designed to facilitate research into language model learning dynamics through a comprehensive suite of small to medium-scale models (1M-1B parameters). Built on a LLAMA-style architecture, Pico emphasizes simplicity, modularity, and research accessibility.

The framework serves two key purposes:
1. **Pre-trained Model Suite**: Access our complete suite of models trained on 420B tokens
2. **Training Framework**: Easily train your own model suite from scratch with minimal setup

This dual-purpose design means researchers can either:
- Use our pre-trained models and checkpoints for immediate analysis
- Train their own suite of models to test specific hypotheses or explore different architectures

## 🔄 Training Philosophy

All models in a Pico suite (whether our pre-trained ones or your custom trained ones):
- Share identical architectures and optimizers
- Train on the same tokens in identical order
- Save rich checkpoint data including activations and gradients
- Enable direct comparisons across model scales

## 📦 Resources

All our pre-trained models and datasets are publicly available through our [HuggingFace organization](https://huggingface.co/pico-lm):
- Pre-trained models (1M to 1B parameters)
- Pre-tokenized training data derived from the DOLMA corpus
- Training checkpoints with activation and gradient information
- Basic evaluation (perplexity) metrics logged throughout training

## 🌟 Why Pico?

Unlike other model suites, Pico is specifically designed for learning dynamics research:

1. **Focused Scale Range**: Covers the critical 1M-1B parameter range where most learning dynamics research is feasible
2. **Consistent Training**: All models see identical data in identical order, enabling true cross-scale comparisons
3. **Rich Analytics**: Automatic saving of activations and gradients for mechanistic interpretability
4. **Research Ready**: Minimal, well-documented code designed to be forked and modified
5. **Clean Data**: Uses a curated, pre-shuffled version of the DOLMA corpus
6. **Train Your Own**: Simple pipeline for training your own suite of models with custom configurations

## 🔑 Key Features

Expand All @@ -18,6 +55,69 @@ Pico is a framework for training and analyzing small language models, designed w
- **SwiGLU activation** function
- **Residual connections** throughout

## 🚀 Quick Start

1. **Clone Project**
```bash
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
```

2. **Configure Environment**
Create `.env` file:
```bash
export HF_TOKEN=your_huggingface_token
export WANDB_API_KEY=your_wandb_key
```

3. **Setup Dependencies**
```bash
source setup.sh
```
### Exploring the Codebase

The core implementation is organized into these key files and packages:

- **`src/model/pico.py`**: The heart of Pico
- LLAMA-style transformer implementation
- Attention mechanism with KV-cache
- RoPE positional embeddings
- Documentation references for each component

- **`src/training/trainer.py`**: Training pipeline
- Distributed training setup
- Checkpoint management
- Logging configuration

- **`src/config`**: Model configuration
- Hyperparameter definitions
- Model architecture settings
- Training parameters

### Common Starting Points

1. **Using Pre-trained Models**
```python
from transformers import AutoModelForCausalLM

# Load a specific model size
model = AutoModelForCausalLM.from_pretrained("pico-lm/[...]")
```

2. **Training Your Own Suite**
```bash
# Edit config/train.yaml to customize your training
poetry run train --config_path configs/train.yaml
```


## 📊 Coming Soon: Pico Analysis

A companion framework for analyzing Pico checkpoints:
- Mechanistic interpretability tools
- Learning dynamics visualization
- Cross-scale model comparisons
- Training trajectory analysis

## 📚 References

Our implementation draws inspiration from and builds upon:
Expand All @@ -27,11 +127,12 @@ Our implementation draws inspiration from and builds upon:

## 🤝 Contributing

We welcome contributions! Whether it's:
- Adding new features
- Improving documentation
- Fixing bugs
- Sharing experimental results
We welcome contributions in:
- New features and improvements
- Documentation and tutorials
- Bug fixes and testing
- Research findings and analysis


## 📝 License

Expand All @@ -40,14 +141,16 @@ Apache 2.0 License
## 📫 Contact

- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
- Author: Richard Diehl Martinez
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)

## 🔍 Citation

If you use Pico in your research, please cite:

```bibtex
@software{pico2024,
author = {Martinez, Richard Diehl},
title = {Pico: Framework for Training Tiny Language Models},
year = {2024},
author = {Diehl Martinez, Richard},
title = {Pico: Framework for Training Tiny Language Models},
year = {2024},
}
```
182 changes: 0 additions & 182 deletions config.py

This file was deleted.

Loading