Pico: Tiny Language Models for Learning Dynamics Research

🚧 Coming Soon! Our complete suite of pre-trained models (1M to 1B parameters) is currently being trained and will be released on HuggingFace organization in January 2025.

Pico is a framework designed to facilitate research into language model learning dynamics through a comprehensive suite of small to medium-scale models (1M-1B parameters). Built on a LLAMA-style architecture, Pico emphasizes simplicity, modularity, and research accessibility.

The framework serves two key purposes:

Pre-trained Model Suite: Access our complete suite of models trained on 420B tokens
Training Framework: Easily train your own model suite from scratch with minimal setup

This dual-purpose design means researchers can either:

Use our pre-trained models and checkpoints for immediate analysis
Train their own suite of models to test specific hypotheses or explore different architectures

🔄 Training Philosophy

All models in a Pico suite (whether our pre-trained ones or your custom trained ones):

Share identical architectures and optimizers
Train on the same tokens in identical order
Save rich checkpoint data including activations and gradients
Enable direct comparisons across model scales

📦 Resources

All our pre-trained models and datasets are publicly available through our HuggingFace organization:

Pre-trained models (1M to 1B parameters)
Pre-tokenized training data derived from the DOLMA corpus
Training checkpoints with activation and gradient information
Basic evaluation (perplexity) metrics logged throughout training

🌟 Why Pico?

Unlike other model suites, Pico is specifically designed for learning dynamics research:

Focused Scale Range: Covers the critical 1M-1B parameter range where most learning dynamics research is feasible
Consistent Training: All models see identical data in identical order, enabling true cross-scale comparisons
Rich Analytics: Automatic saving of activations and gradients for mechanistic interpretability
Research Ready: Minimal, well-documented code designed to be forked and modified
Clean Data: Uses a curated, pre-shuffled version of the DOLMA corpus
Train Your Own: Simple pipeline for training your own suite of models with custom configurations

🔑 Key Features

Simple Architecture: Clean, modular implementation of core transformer components
Educational Focus: Well-documented code with clear references to academic papers
Research Ready: Built-in tools for storing and studying model learning dynamics
Efficient Training: Pre-tokenized dataset and optimized training loop
Modern Stack: Built with PyTorch Lightning, Wandb, and HuggingFace integrations

🏗️ Core Components

RMSNorm for stable layer normalization
Rotary Positional Embeddings (RoPE) for position encoding
Multi-head attention with KV-cache support
SwiGLU activation function
Residual connections throughout

🚀 Quick Start

Clone Project

git clone https://github.com/rdiehlmartinez/pico.git && cd pico

Configure Environment Create .env file:

export HF_TOKEN=your_huggingface_token
export WANDB_API_KEY=your_wandb_key

Setup Dependencies

source setup.sh

Exploring the Codebase

The core implementation is organized into these key files and packages:

src/model/pico.py: The heart of Pico
- LLAMA-style transformer implementation
- Attention mechanism with KV-cache
- RoPE positional embeddings
- Documentation references for each component
src/training/trainer.py: Training pipeline
- Distributed training setup
- Checkpoint management
- Logging configuration
src/config: Model configuration
- Hyperparameter definitions
- Model architecture settings
- Training parameters
src/checkpointing: Checkpointing and State Management
- Training state persistence (model, optimizer, scheduler)
- Learning dynamics tracking (activations, weights, gradients)
- Evaluation results storage
- Automatically store huggingface-compatible version of model for down-stream use

Common Starting Points

Using Pre-trained Models

from transformers import AutoModelForCausalLM

# Load a specific model size
model = AutoModelForCausalLM.from_pretrained("pico-lm/[...]")

Training Your Own Suite

# Create a config yaml file, e.g. `my_config.yaml`
# You can follow the provided demo template in configs/demo.yaml
# If no config file is provided the default config values are used
poetry run train --config_path my_config.yaml

📊 Coming Soon: Pico Analysis

A companion framework for analyzing Pico checkpoints:

Mechanistic interpretability tools
Learning dynamics visualization
Cross-scale model comparisons
Training trajectory analysis

📚 References

Our implementation draws inspiration from and builds upon:

🤝 Contributing

We welcome contributions in:

New features and improvements
Documentation and tutorials
Bug fixes and testing
Research findings and analysis

📝 License

Apache 2.0 License

📫 Contact

GitHub: rdiehlmartinez/pico
Author: Richard Diehl Martinez

Citation

If you use Pico in your research, please cite:

@software{pico2024,
    author = {Diehl Martinez, Richard},
    title = {Pico: Framework for Training Tiny Language Models},
    year = {2024},
}

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
configs		configs
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pico: Tiny Language Models for Learning Dynamics Research

🔄 Training Philosophy

📦 Resources

🌟 Why Pico?

🔑 Key Features

🏗️ Core Components

🚀 Quick Start

Exploring the Codebase

Common Starting Points

📊 Coming Soon: Pico Analysis

📚 References

🤝 Contributing

📝 License

📫 Contact

Citation

About

Releases

Packages

Contributors 3

Languages

License

rdiehlmartinez/pico

Folders and files

Latest commit

History

Repository files navigation

Pico: Tiny Language Models for Learning Dynamics Research

🔄 Training Philosophy

📦 Resources

🌟 Why Pico?

🔑 Key Features

🏗️ Core Components

🚀 Quick Start

Exploring the Codebase

Common Starting Points

📊 Coming Soon: Pico Analysis

📚 References

🤝 Contributing

📝 License

📫 Contact

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages