Skip to content

Light770/Open-Garcon

Repository files navigation

Circuit Tracing for Small Language Models

This project implements the Sparse Crosscoder approach described in "Sparse Crosscoders for Cross-Layer Features and Model Diffing" for analyzing circuits within small language models like Qwen 2.5-3B.

Overview

Sparse Crosscoders enable us to understand how language models process information internally by:

  1. Identifying interpretable features across different layers of the model
  2. Tracking how these features interact to form computational circuits
  3. Visualizing attribution graphs showing information flow within the model

Unlike traditional approaches that analyze each layer independently, Crosscoders can discover features that span multiple layers, revealing how complex computations are distributed throughout the model.

Core Concepts

1. Cross-Layer Transcoder (CLT)

The Cross-Layer Transcoder is a sparse autoencoder that reads from one layer and writes to subsequent layers. It learns interpretable features that can track how information flows through the model.

Key equations:

  • Feature Activation: aₗ = JumpReLU(W_enc_l * xₗ)
  • Layer Reconstruction: ŷₗ = ∑_{l'=1}^l W_dec_l'→l * aₗ'

2. Attribution Graphs

Attribution graphs reveal how features influence each other and ultimately contribute to the model's output:

  • Nodes: Individual features
  • Edges: Attributions (strength of influence) between features
  • Attribution calculation: A_s→t := a_s * w_s→t

Installation

# Clone the repository
git clone https://github.com/username/circuit-tracing.git
cd circuit-tracing

# Install dependencies
pip install torch transformers matplotlib networkx numpy tqdm

Usage

Training a Cross-Layer Transcoder

python example_usage.py --mode train \
    --model_name Qwen/Qwen2.5-3B \
    --clt_path ./models/clt_qwen25_3b \
    --train_samples 1000 \
    --data_path ./data/training_text.txt
python circuit_tracing_example.py --mode train --model_name Qwen/Qwen2.5-3B --clt_path models/clt_qwen25_3b --train_samples 25 --data_path training/demo/demo_training_data_25.txt

Analyzing Prompts

python example_usage.py --mode analyze \
    --model_name Qwen/Qwen2.5-3B \
    --clt_path ./models/clt_qwen25_3b \
    --prompt "The capital of France is"
python circuit_tracing_example.py --mode analyze --model_name Qwen/Qwen2.5-3B --clt_path models/clt_qwen25_3b --prompt "The capital of France is"
python debug_circuit_tracing.py --model_name Qwen/Qwen2.5-3B --clt_path models/clt_qwen25_3b --prompt "The capital of France is" --timeout 600 --reduced_features 5

Analyzing Features

python feature_analysis.py \
    --model_name Qwen/Qwen2.5-3B \
    --clt_path ./models/clt_qwen25_3b \
    --prompts_file ./data/analysis_prompts.txt \
    --output_dir ./results/feature_analysis \
    --feature_idx 42  # Optional: analyze a specific feature
python feature_analysis_tool.py --model_name Qwen/Qwen2.5-3B --clt_path ./models/clt_qwen25_3b  --prompts_file ./data/analysis_prompts.txt --output_dir ./results/feature_analysis  --feature_idx 42  # Optional: analyze a specific feature #1627

Code Structure

  • circuit_tracing.py: Core implementation of the Cross-Layer Transcoder and attribution analysis
  • example_usage.py: Example script demonstrating how to train a CLT and analyze prompts
  • feature_analysis.py: Tools for analyzing and interpreting specific features

Detailed Explanation

How the Circuit Tracing Process Works

  1. Training the Cross-Layer Transcoder:

    • Collect activations from the model's residual stream and MLP outputs
    • Train the CLT to reconstruct MLP outputs using sparse feature activations
    • The learned features represent interpretable concepts and computations
  2. Analyzing Prompts:

    • Run the model on a prompt and record activations
    • Use the CLT to extract feature activations
    • Compute attributions between features
    • Generate and visualize the attribution graph
  3. Interpreting Features:

    • Identify top-activating prompts for each feature
    • Analyze how features influence the model's predictions
    • Group features into "supernodes" representing higher-level concepts

Technical Details

Key Implementation Aspects

  1. JumpReLU Activation: Improves feature sparsity and interpretability
  2. Layer Normalization: Ensures stable training across different layers
  3. Sparsity Penalty: Encourages the development of sparse, interpretable features
  4. Virtual Weights: Capture context-independent interactions between features

Customization Options

  • Number of Features: Adjust based on model size and complexity
  • Sparsity Coefficient: Control the trade-off between reconstruction accuracy and feature interpretability
  • Layer Selection: Focus on specific layers of interest
  • Attribution Thresholds: Filter attributions to control graph complexity

Example Results

When analyzing prompts like "The capital of France is", the attribution graph might show:

  • Features representing "capital" and "France" concepts
  • Their influence on features representing "Paris"
  • The flow of information to the final logit prediction

Limitations

  • Computation Intensive: Training a CLT requires significant computational resources
  • Approximations: The attribution calculation involves approximations that may not perfectly represent the model's computation
  • Interpretability Challenges: Not all features may have clear interpretations
  • Limited to MLPs: Current implementation focuses on MLP computations, not attention mechanisms

Future Directions

  1. Attention Integration: Extend to model attention mechanisms
  2. Feature Clustering: Develop automated methods for grouping related features
  3. Circuit Extraction: Automatically identify common circuit motifs
  4. Safety Analysis: Apply to studying safety mechanisms in aligned models

References

  1. "Sparse Crosscoders for Cross-Layer Features and Model Diffing", Anthropic, October 2024
  2. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", Bricken et al., 2023
  3. "Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders", Rajamanoharan et al., 2024

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

An open source pipeline for Mecanistic Interpretability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages