Skip to content

Commit

Permalink
fleshing out README with more information
Browse files Browse the repository at this point in the history
  • Loading branch information
rdiehlmartinez committed Dec 3, 2024
1 parent f2827c9 commit fc87cd1
Showing 1 changed file with 114 additions and 11 deletions.
125 changes: 114 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,43 @@
# 🎯 Pico: Tiny Language Models for Learning Dynamics Research
# Pico: Tiny Language Models for Learning Dynamics Research

Pico is a framework for training and analyzing small language models, designed with clarity and educational purposes in mind. Built on a LLAMA-style architecture, Pico makes it easy to experiment with and understand transformer-based language models.
> 🚧 **Coming Soon!** Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released on [HuggingFace organization](https://huggingface.co/pico-lm) in January 2025.
Pico is a framework designed to facilitate research into language model learning dynamics through a comprehensive suite of small to medium-scale models (1M-1B parameters). Built on a LLAMA-style architecture, Pico emphasizes simplicity, modularity, and research accessibility.

The framework serves two key purposes:
1. **Pre-trained Model Suite**: Access our complete suite of models trained on 420B tokens
2. **Training Framework**: Easily train your own model suite from scratch with minimal setup

This dual-purpose design means researchers can either:
- Use our pre-trained models and checkpoints for immediate analysis
- Train their own suite of models to test specific hypotheses or explore different architectures

## 🔄 Training Philosophy

All models in a Pico suite (whether our pre-trained ones or your custom trained ones):
- Share identical architectures and optimizers
- Train on the same tokens in identical order
- Save rich checkpoint data including activations and gradients
- Enable direct comparisons across model scales

## 📦 Resources

All our pre-trained models and datasets are publicly available through our [HuggingFace organization](https://huggingface.co/pico-lm):
- Pre-trained models (1M to 1B parameters)
- Pre-tokenized training data derived from the DOLMA corpus
- Training checkpoints with activation and gradient information
- Basic evaluation (perplexity) metrics logged throughout training

## 🌟 Why Pico?

Unlike other model suites, Pico is specifically designed for learning dynamics research:

1. **Focused Scale Range**: Covers the critical 1M-1B parameter range where most learning dynamics research is feasible
2. **Consistent Training**: All models see identical data in identical order, enabling true cross-scale comparisons
3. **Rich Analytics**: Automatic saving of activations and gradients for mechanistic interpretability
4. **Research Ready**: Minimal, well-documented code designed to be forked and modified
5. **Clean Data**: Uses a curated, pre-shuffled version of the DOLMA corpus
6. **Train Your Own**: Simple pipeline for training your own suite of models with custom configurations

## 🔑 Key Features

Expand All @@ -18,6 +55,69 @@ Pico is a framework for training and analyzing small language models, designed w
- **SwiGLU activation** function
- **Residual connections** throughout

## 🚀 Quick Start

1. **Clone Project**
```bash
git clone https://github.com/rdiehlmartinez/pico.git && cd pico
```

2. **Configure Environment**
Create `.env` file:
```bash
export HF_TOKEN=your_huggingface_token
export WANDB_API_KEY=your_wandb_key
```

3. **Setup Dependencies**
```bash
source setup.sh
```
### Exploring the Codebase

The core implementation is organized into these key files:

- **`model.py`**: The heart of Pico
- LLAMA-style transformer implementation
- Attention mechanism with KV-cache
- RoPE positional embeddings
- Documentation references for each component

- **`train.py`**: Training pipeline
- Distributed training setup
- Checkpoint management
- Logging configuration

- **`config.py`**: Model configuration
- Hyperparameter definitions
- Model architecture settings
- Training parameters

### Common Starting Points

1. **Using Pre-trained Models**
```python
from transformers import AutoModelForCausalLM

# Load a specific model size
model = AutoModelForCausalLM.from_pretrained("pico-lm/[...]")
```

2. **Training Your Own Suite**
```bash
# Edit config/train.yaml to customize your training
python train.py --config configs/train.yaml
```


## 📊 Coming Soon: Pico Analysis

A companion framework for analyzing Pico checkpoints:
- Mechanistic interpretability tools
- Learning dynamics visualization
- Cross-scale model comparisons
- Training trajectory analysis

## 📚 References

Our implementation draws inspiration from and builds upon:
Expand All @@ -27,11 +127,12 @@ Our implementation draws inspiration from and builds upon:

## 🤝 Contributing

We welcome contributions! Whether it's:
- Adding new features
- Improving documentation
- Fixing bugs
- Sharing experimental results
We welcome contributions in:
- New features and improvements
- Documentation and tutorials
- Bug fixes and testing
- Research findings and analysis


## 📝 License

Expand All @@ -40,14 +141,16 @@ Apache 2.0 License
## 📫 Contact

- GitHub: [rdiehlmartinez/pico](https://github.com/rdiehlmartinez/pico)
- Author: Richard Diehl Martinez
- Author: [Richard Diehl Martinez](https://richarddiehlmartinez.com)

## 🔍 Citation

If you use Pico in your research, please cite:

```bibtex
@software{pico2024,
author = {Martinez, Richard Diehl},
title = {Pico: Framework for Training Tiny Language Models},
year = {2024},
author = {Martinez, Richard Diehl},
title = {Pico: Framework for Training Tiny Language Models},
year = {2024},
}
```

0 comments on commit fc87cd1

Please sign in to comment.