Mechanistic Interpretability: Logit Lens & Attention Analysis for Gemma 2B Models #154

tenseisoham · 2025-03-18T07:21:10Z

Description of the feature request:

I would like to request the addition of mechanistic interpretability tools for the Gemma 2B models in the Gemma Cookbook. Specifically, I propose incorporating two types of experiments to better understand the model’s internal computations for interpretability and alignment researchers:

1. Logit Lens Analysis

This experiment would allow researchers to probe the hidden representations at different layers of the model to understand how information is transformed at each stage.
This can be achieved by decoding activations at intermediate layers and comparing them with final predictions.
Implementing this for Gemma 2B would provide insights into progressive knowledge formation and retention within the model.

2. Attention Analysis Experiments

Understanding how attention heads interact at different layers can reveal which tokens are influential for the model’s decision-making.
The notebook could include:
Attention heatmaps for different layers and heads.
Causal tracing to see how information flows through the attention mechanism.
Ablation studies to test how removing certain attention heads affects output.
By integrating these experiments into the Gemma Cookbook, researchers and developers can gain deeper insights into the internal reasoning of Gemma 2B models.

TransformerLens for Attention Analysis

What problem are you trying to solve with this feature?

Currently, there is limited documentation and tooling for interpreting intermediate activations specifically in Gemma 2B models.
Existing resources focus mainly on end-to-end outputs, but understanding the internal mechanisms can help with:
Debugging & fine-tuning (e.g., diagnosing hallucinations or biases) and Transparency & trustworthiness in AI models.

These experiments have been successfully applied to other transformer-based models, and adding them to the Gemma Cookbook would make Gemma 2B more accessible for interpretability research.

Any other information you'd like to share?

Example of Logit Lens

Would love to ask the supervisors what they think of this!
cc/ @bebechien et.al

bebechien added the wishlist A wish list of cookbooks label Apr 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mechanistic Interpretability: Logit Lens & Attention Analysis for Gemma 2B Models #154

Mechanistic Interpretability: Logit Lens & Attention Analysis for Gemma 2B Models #154

tenseisoham commented Mar 18, 2025

Mechanistic Interpretability: Logit Lens & Attention Analysis for Gemma 2B Models #154

Mechanistic Interpretability: Logit Lens & Attention Analysis for Gemma 2B Models #154

Comments

tenseisoham commented Mar 18, 2025

Description of the feature request:

What problem are you trying to solve with this feature?

Any other information you'd like to share?

Example of Logit Lens