You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to request the addition of mechanistic interpretability tools for the Gemma 2B models in the Gemma Cookbook. Specifically, I propose incorporating two types of experiments to better understand the model’s internal computations for interpretability and alignment researchers:
1. Logit Lens Analysis
This experiment would allow researchers to probe the hidden representations at different layers of the model to understand how information is transformed at each stage.
This can be achieved by decoding activations at intermediate layers and comparing them with final predictions.
Implementing this for Gemma 2B would provide insights into progressive knowledge formation and retention within the model.
Understanding how attention heads interact at different layers can reveal which tokens are influential for the model’s decision-making.
The notebook could include:
Attention heatmaps for different layers and heads.
Causal tracing to see how information flows through the attention mechanism.
Ablation studies to test how removing certain attention heads affects output.
By integrating these experiments into the Gemma Cookbook, researchers and developers can gain deeper insights into the internal reasoning of Gemma 2B models.
What problem are you trying to solve with this feature?
Currently, there is limited documentation and tooling for interpreting intermediate activations specifically in Gemma 2B models.
Existing resources focus mainly on end-to-end outputs, but understanding the internal mechanisms can help with:
Debugging & fine-tuning (e.g., diagnosing hallucinations or biases) and Transparency & trustworthiness in AI models.
These experiments have been successfully applied to other transformer-based models, and adding them to the Gemma Cookbook would make Gemma 2B more accessible for interpretability research.
Any other information you'd like to share?
Example of Logit Lens
Would love to ask the supervisors what they think of this!
cc/ @bebechien et.al
The text was updated successfully, but these errors were encountered:
Description of the feature request:
I would like to request the addition of mechanistic interpretability tools for the Gemma 2B models in the Gemma Cookbook. Specifically, I propose incorporating two types of experiments to better understand the model’s internal computations for interpretability and alignment researchers:
1. Logit Lens Analysis
This experiment would allow researchers to probe the hidden representations at different layers of the model to understand how information is transformed at each stage.
This can be achieved by decoding activations at intermediate layers and comparing them with final predictions.
Implementing this for Gemma 2B would provide insights into progressive knowledge formation and retention within the model.
2. Attention Analysis Experiments
Understanding how attention heads interact at different layers can reveal which tokens are influential for the model’s decision-making.
The notebook could include:
Attention heatmaps for different layers and heads.
Causal tracing to see how information flows through the attention mechanism.
Ablation studies to test how removing certain attention heads affects output.
By integrating these experiments into the Gemma Cookbook, researchers and developers can gain deeper insights into the internal reasoning of Gemma 2B models.
What problem are you trying to solve with this feature?
Debugging & fine-tuning (e.g., diagnosing hallucinations or biases) and Transparency & trustworthiness in AI models.
These experiments have been successfully applied to other transformer-based models, and adding them to the Gemma Cookbook would make Gemma 2B more accessible for interpretability research.
Any other information you'd like to share?
Example of Logit Lens
Would love to ask the supervisors what they think of this!
cc/ @bebechien et.al
The text was updated successfully, but these errors were encountered: