Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mechanistic Interpretability: Logit Lens & Attention Analysis for Gemma 2B Models #154

Open
tenseisoham opened this issue Mar 18, 2025 · 0 comments
Labels
wishlist A wish list of cookbooks

Comments

@tenseisoham
Copy link
Contributor

Description of the feature request:

I would like to request the addition of mechanistic interpretability tools for the Gemma 2B models in the Gemma Cookbook. Specifically, I propose incorporating two types of experiments to better understand the model’s internal computations for interpretability and alignment researchers:


1. Logit Lens Analysis

This experiment would allow researchers to probe the hidden representations at different layers of the model to understand how information is transformed at each stage.
This can be achieved by decoding activations at intermediate layers and comparing them with final predictions.
Implementing this for Gemma 2B would provide insights into progressive knowledge formation and retention within the model.


2. Attention Analysis Experiments

Understanding how attention heads interact at different layers can reveal which tokens are influential for the model’s decision-making.
The notebook could include:
Attention heatmaps for different layers and heads.
Causal tracing to see how information flows through the attention mechanism.
Ablation studies to test how removing certain attention heads affects output.
By integrating these experiments into the Gemma Cookbook, researchers and developers can gain deeper insights into the internal reasoning of Gemma 2B models.

What problem are you trying to solve with this feature?

  • Currently, there is limited documentation and tooling for interpreting intermediate activations specifically in Gemma 2B models.
  • Existing resources focus mainly on end-to-end outputs, but understanding the internal mechanisms can help with:
    Debugging & fine-tuning (e.g., diagnosing hallucinations or biases) and Transparency & trustworthiness in AI models.

These experiments have been successfully applied to other transformer-based models, and adding them to the Gemma Cookbook would make Gemma 2B more accessible for interpretability research.

Any other information you'd like to share?

Example of Logit Lens

Image

Would love to ask the supervisors what they think of this!
cc/ @bebechien et.al

@bebechien bebechien added the wishlist A wish list of cookbooks label Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wishlist A wish list of cookbooks
Projects
None yet
Development

No branches or pull requests

2 participants