Skip to content

PhilipQuirke/quanta_maths

Repository files navigation

Introduction

This library support goals and uses terminology introduced in the paper Increasing Trust in Language Models through the Reuse of Verified Circuits. Please read the paper. In brief:

  • Given an existing transformer model with low loss, this library helps a researcher to analyze and understand the algorithm implemented by a transformer model.
  • The "useful" token positions, attention heads and MLP neurons that are used in predictions are identified.
  • Various tools and techniques evaluate aspects of the model's "behavior" (e.g. attention patterns).
  • The researcher can extend the tools with model-specific searches and tests - searching for hypothesised model components that perform model-specific algorithm "sub-tasks" (e.g. Base Add in the Addition model)
  • Useful facts found in this way are stored as JSON (refer Useful_Tags for details) and can be visualized (refer Assets for samples).
  • A researcher can describe an algorithm hypothesis as a series of claims, and evaluate those claims against the facts found. The resulting insights can be used to refine and\or extend both the algorithm sub-task tests and the algorithm hypothesis description, leading to a full description of the model's algorithm.

Installation

From source

git clone https://github.com/PhilipQuirke/quanta_maths.git
cd MathsMechInterp
pip install .

Test bed

Much of this library is generic (can be applied to any transformer model). As a "real-world" testbed to help refine this library we use models trained to perform integer addition and subtraction (e.g. 133357+182243=+0315600 and 123450-345670=-0123230). Arithmetic-specific algorithm sub-task searches are defined (e.g. Base Add, Use Sum 9, Make Carry, Base Subtract, Borrow One). Addition and Subtraction hypothesises are described and evaluated in the Colab notebook QuantaMathsAnalyse.ipynb. Arithmetic-specific python code is in files like maths_config.py.

Folders, Files and Classes

This library contains files:

  • Notebooks: Jupyter notebooks which are run in Google Colab or Jupyter:

    • Train: Colab QuantaMathsTrain.ipynb is used to train transformer arithmetic models.
      • Outputs pth and json files that are (manually) stored on HuggingFace
    • Analysis: Colab QuantaMathsAnalyse.ipynb is used to analyze the behavior and algorithm sub-tasks of transformer arithmetic models
      • Inputs pth files (generated above) from HuggingFace
      • Outputs *_behavior and *_algorithm json files that are (manually) stored on HuggingFace
    • Algorithm: Colab QuantaMathsAlgorithm.ipynb describes/tests an overall algorithm for a model (based on behavior and algorithm sub-tasks data)
      • Inputs *_behavior and *_algorithm json files (generated above) from HuggingFace
  • QuantaMechInterp: Python library code imported into the notebooks:

    • model_*.py: Contains the configuration of the transformer model being trained/analysed. Includes class ModelConfig
    • useful_*.py: Contains data on the useful token positions and useful nodes (attention heads and MLP neurons) that the model uses in predictions. Includes class UsefulConfig derived from ModelConfig. Refer Useful_Tags for more detail.
    • algo_*.py: Contains tools to support declaring and validating a model algorithm. Includes class AlgoConfig derived from UsefulConfig.
    • quanta_*.py: Contains categorisations of model behavior (aka quanta), with ways to detect, filter and graph them. Refer Filter for more detail.
    • ablate_*.py: Contains ways to "intervention ablate" the model and detect the impact of the ablation
    • maths_*.py: Contains specializations of the above specific to arithmetic (addition and subtraction) transformer models. Includes class MathsConfig derived from AlgoConfig.
  • Tests: Unit tests

HuggingFace resources

The HuggingFace website holds the output files generated by the CoLab notebooks for ~45 models:

For each model these output files available are:

  • model's weight (model.pth),
  • model's training details (training.json),
  • generic analysis facts (behavior.json), and
  • maths-specific results from searching for hypothesis algorithm features (features.json)

Refer Hugging_Models for more detail.

Papers

The papers associated with this content are:

Extending the code

Most exploratory work is done in a Google Colab in the 'train and 'analyse' notebooks. After some new code is successfully developed and tested in the notebook, the code is migrated to the quanta_tools code folder.

About

Tool used to verify accuracy of transformer model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published