thedch
diff --git a/‎README.md
Lines changed: 196 additions & 47 deletions b/‎README.md
Lines changed: 196 additions & 47 deletions
diff --git a/‎assets/french.png
-191 KB b/‎assets/french.png
-191 KB
diff --git a/‎assets/german.png
-184 KB b/‎assets/german.png
-184 KB
diff --git a/‎assets/python.png
281 KB b/‎assets/python.png
281 KB
diff --git a/‎assets/scandinavian.png
-184 KB b/‎assets/scandinavian.png
-184 KB
diff --git a/‎assets/shakespeare.png
351 KB b/‎assets/shakespeare.png
351 KB
diff --git a/‎autoencoder/prepare_autoencoder_dataset.py
Lines changed: 11 additions & 11 deletions b/‎autoencoder/prepare_autoencoder_dataset.py
Lines changed: 11 additions & 11 deletions
diff --git a/‎autoencoder/train_autoencoder.py
Lines changed: 14 additions & 7 deletions b/‎autoencoder/train_autoencoder.py
Lines changed: 14 additions & 7 deletions
@@ -1,59 +1,208 @@
+# Towards Monosemanticity: Decomposing Language Models With Dictionary Learning _(on your laptop! in 10 minutes or less!)_
 
-# Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
+Hello! You are probably aware of this [very cool research from Anthropic](https://transformer-circuits.pub/2023/monosemantic-features/index.html)
+where they train an autoencoder to interpret the inner workings of a transformer.
 
-This repository reproduces results of [Anthropic's Sparse Dictionary Learning paper](https://transformer-circuits.pub/2023/monosemantic-features/). The codebase is quite rough, but the results are excellent. See the [feature interface](https://shehper.github.io/feature-interface/) to browse through the features learned by the sparse autoencoder.  There are improvements to be made (see the [TODOs](#todos) section below), and I will work on them intermittently as I juggle things in life :)
+(If you are not aware, feel free to read it, it's quite fascinating)
 
-I trained a 1-layer transformer model from scratch using [nanoGPT](https://github.com/karpathy/nanoGPT) with $d_{\text{model}} = 128$. Then, I trained a sparse autoencoder with $4096$ features on its MLP activations as in [Anthropic's paper](https://transformer-circuits.pub/2023/monosemantic-features/). 93% of the autoencoder neurons were alive, only 5% of which were of ultra-low density. There are several interesting features. For example, there is [a feature for French language](https://shehper.github.io/feature-interface/?page=2011),
+I was curious how this technique scales when you don't use very much data or compute. Does it still work?
+At all?
 
-<p align="center">
-  <img src="./assets/french.png" width="700" />
-</p>
-
-a feature each for German, Japanese, and many other languages, as well many other interesting features:
-
-- [A feature for German](https://shehper.github.io/feature-interface/?page=156)
-- [A feature for Scandinavian languages](https://shehper.github.io/feature-interface/?page=1634)
-- [A feature for Japanese](https://shehper.github.io/feature-interface/?page=1989)
-- [A feature for Hebrew](https://shehper.github.io/feature-interface/?page=2026)
-- [A feature for Cyrilic vowels](https://shehper.github.io/feature-interface/?page=3987)
-- [A feature for token "at" in words like "Croatian", "Scat", "Hayat", etc](https://shehper.github.io/feature-interface/?page=1662)
-- [A single token feature for "much"](https://shehper.github.io/feature-interface/?page=2760)
-- [A feature for sports leagues: NHL, NBA, etc](https://shehper.github.io/feature-interface/?page=379)
-- [A feature for Gregorian calendar dates](https://shehper.github.io/feature-interface/?page=344)
-- [A feature for "when"](https://shehper.github.io/feature-interface/?page=2022):
-      - this feature particularly stands out because of the size of the mode around large activation values. 
-- [A feature for "&"](https://shehper.github.io/feature-interface/?page=1916)
-- [A feature for ")"](https://shehper.github.io/feature-interface/?page=1917)
-- [A feature for "v" in URLs like "com/watch?v=SiN8](https://shehper.github.io/feature-interface/?page=27)
-- [A feature for programming code](https://shehper.github.io/feature-interface/?page=45)
-- [A feature for Donald Trump](https://shehper.github.io/feature-interface/?page=292)
-- [A feature for LaTeX](https://shehper.github.io/feature-interface/?page=538)
-
-<!-- - [Bigram feature 1?](https://shehper.github.io/feature-interface/?page=446)
-[Bigram feature 2?](https://shehper.github.io/feature-interface/?page=482) -->
+So, I forked this lovely repo from @shehper (thank you shehper!), and made some modifications.
+As shehper notes in their original README, their results are excellent, but the code is a bit rough.
+
+I cleaned up a code a bit, but since I don't have any GPUs, my results are worse. I still had a good time
+and found it interesting. If you'd like to reproduce my results in a few minutes on your
+(hopefully new-ish Apple Silicon) laptop, read on!
+
+## Reproduction Instructions (+ commentary)
 
-<!-- - [A feature for some negative words/news](https://shehper.github.io/feature-interface/?page=218) -->
+First, clone the repo. Then, run:
 
-### Training Details
+```
+cd transformer/data/shakespeare_char
+python prepare.py
+```
+
+This saves the encoded version of the dataset to disk, and trains a tokenizer.
+
+_What? Training a tokenizer?_
+
+That's right, reader -- we're adding additional complexity with dubious benefit, right out of the gate.
+
+I didn't want to use the full gpt-2 vocabulary (trying to keep the model small), so I just used the
+famous shakespeare_char dataset, which tokenizes per character and has a vocab size less than 100.
+
+Initial results with this were poor, and I wasn't able to derive any sort of clear link
+between specific neuron activations and semantic content understanding.
+I hypothesized that was partly due to the tokenization, so I added a custom tokenizer
+with a vocabulary of 1024.
+
+This tokenizer was trained on a custom dataset of 1M characters of shakespeare_char and 1M characters
+of a Python dataset I found on HuggingFace. This is a notable detail: because of the low compute +
+low data regime I'm working in, my idea was to train a model on a relatively bi-modal dataset, and see
+if the neurons would nicely fall into two categories, either a "Shakespeare Neuron" or a "Python Neuron".
+(Spoiler: they kinda did!)
+
+For context, the Anthropic paper has some great details on their training regime -- they train their
+tiny transformer on 100B tokens, and their autoencoder on 8B activation vectors. I am obviously quite a
+few orders of magnitude below that, sadly.
+
+Okay, now that your dataset and tokenizer is ready to go, let's train some models!
+
+```
+# need MPS_FALLBACK because nn.Dropout isn't supported, :(
+
+# in the transformer/ directory
+PYTORCH_ENABLE_MPS_FALLBACK=1 python train_transformer.py \
+  config/train_shakespeare_char.py \
+  --device=mps \
+  --max_iters=7500 \
+  --lr_decay_iters=7500 \
+  --n_embd=192 \
+  --out_dir=0730-shakespeare-python-custom-tok \
+  --batch_size=24 \
+  --compile=True
+```
+
+Here we train our tiny transformer. This is all ripped out of minGPT, thanks Karpathy!
+This takes 5 minutes or so on my MacBook. There a lot of hyperparams there that are ripe for tuning.
+
+Once it's trained, it's good to check if it actually produces sensible outputs.
+
+```
+# in the autoencoder/ directory
+python generate_tokens.py \
+  --prompt 'def run' \ # or 'oh romeo'
+  --gpt_ckpt_dir=0730-shakespeare-python-custom-tok
+```
+
+When I run this, I get:
+
+```
+oh romeo!
+
+First Servingman:
+Yecond Servingman:
+What, had, are in Pomprecy thyself.
+
+First Servingman:
+Good lady?
+First Servingman:
+Spirrairint sir:
+Thour, sir, thou art 'tis
+```
+
+and:
+
+```
+def run_path( artist):
+ app.0
+  sum = 0
+  sum += generate_cart = 0
+ sum(input("Facters)
+print(result)
+###
+"""
+def num_sum(ates, b):
+  sum = 0
+  a = 0
+  while b < 0:
+   sum += sum1
+  while 1:
+   min = b
+  for print(sum + min(num) + b += 1
+  else
+```
+
+Which, hey, not too bad for a few minutes of local training! Vaguely reasonable Python and Shakespeare.
+
+Now, it's time to train the autoencoder. First, prepare the dataset:
+
+```
+python prepare_autoencoder_dataset.py \
+  --num_contexts=7500 \
+  --num_sampled_tokens=200 \
+  --dataset=shakespeare_char \
+  --gpt_ckpt_dir=0730-shakespeare-python-custom-tok
+```
+
+This runs a bunch of forward passes on the trained transformer, and saves them to disk as `.pt` files for later training.
+This is done as a memory optimization, and arguably isn't needed at small scale training. The 7500 contexts is tuned such that it covers
+the entirety of the dataset that we saved to disk the previous `prepare.py` script. The Anthropic paper mentions that they see best results
+training the autoencoder without data re-sampling  (aka, not repeating data), so we follow that here.
+
+Let's train the autoencoder:
+
+```
+python train_autoencoder.py \
+  --device=mps \
+  --l1_coeff=3e-7 \
+  --learning_rate=3e-4 \
+  --gpt_ckpt_dir=0730-shakespeare-python-custom-tok \
+  --dataset=shakespeare_char \
+  --batch_size=2048 \
+  --resampling_interval=500 \
+  --resampling_data_size=100 \
+  --save_interval=1000 \
+  --n_features=1536
+```
+
+Anthropic's paper discusses a concept of "neuron resampling", where they revive dead neurons during training. Among other factors, a cause
+of dead neurons is training too long. Since I'm training for a very short period of time, I don't see any dead neurons, and that functionality
+wasn't needed -- but if you'd like to train longer, you should be aware of this.
+
+Also, notice the `n_features=1536` -- this is a 2x multiple autoencoder. Not that the embedding dimension of the transformer we trained is 192.
+Since we're plucking the features from inside the MLP, which has is projected into 4x the embedding dimension (aka 768), and 1536 is twice that.
+
+Anthropic mentions they train a family of models with multiples from 1x to 256x. For a low-compute reproduce, 2x seemed to work well. I tried
+a few other multiples and didn't see significant difference.
+
+Note that training the autoencoder is quite fast. This is probably the weakest point of the pipeline, and more data would probably help. An
+obvious TODO would be to scale this up to ~1 hour of training or so, and see how things change.
+
+Once this is done, it's time to actually inspect the features, by running repeated forward passes with certain neurons suppressed.
+
+```
+python build_website.py \
+  --device=mps \
+  --dataset=shakespeare_char \
+  --gpt_ckpt_dir=0730-shakespeare-python-custom-tok \
+  --sae_ckpt_dir=2024-07-31-0936 \
+  --num_contexts=100 \
+  --num_phases=96
+```
+
+## Results
+
+We do see some differentiation between neurons! But, not a tremendous amount. The activations don't follow a nice power law as one might
+hope. I hypothesize this is due to lack of data.
+
+My hope was that there would be clearly delineated "Python" vs "Shakespeare" neurons. This happened a bit, but not as much as I would have liked.
+One obvious thing to try would be train on multiple character sets, for example English and Japanese. This would create differences at the token
+level, that might allow for a more clear fragmentation of the internals.
+
+Plenty of Neurons seem to be "majority Python", but few neurons seem to be "majority Shakespeare".
+
+Here's an example of a Python neuron:
 
-I used the "OpenWebText" dataset to train the transformer model, to generate the MLP activations dataset for the autoencoder, and to generate the feature interface visualizations. The transformer model had $d_{\text{model}}= 128$, $d_{\text{MLP}} = 512$, and $n_{\text{head}}= 4$. I trained this model for $2 \times 10^5$ iterations to roughly match the number of epochs with [Anthropic's training procedure](https://transformer-circuits.pub/2023/monosemantic-features#appendix-transformer).
-
-I collected the dataset of 4B MLP activations by performing forward pass on 20M prompts (each of length 1024), keeping 200 activation vectors from each prompt. Next, I trained the autoencoder for approximately $5 \times 10^5$ training steps at batch size 8192 and learning rate $3 \times 10^{-4}$. I performed neuron resampling 4 times during training at training steps $2.5 \times i \times 10^4$ for $i=1, 2, 3, 4$. See a complete log of the training run on the [W&B page](https://wandb.ai/shehper/sparse-autoencoder-openwebtext-public/runs/vjbcwjsf?nw=nwusershehper). The L1-coefficient for this training run is $10^{-3}$. I selected the L1-coefficient and the learning rate by performing a grid search.
+<p align="center">
+  <img src="./assets/python.png" width="700" />
+</p>
 
-For the most part, I followed the training procedure described in the [appendix](https://transformer-circuits.pub/2023/monosemantic-features#appendix-autoencoder) of Anthropic's original paper. I did not follow the improvements they suggested in their [January](https://transformer-circuits.pub/2024/jan-update/index.html) and [February](https://transformer-circuits.pub/2024/feb-update/index.html) updates. 
+And here's an example of a (rare) Shakespeare neuron:
 
-### TODOs
-- Incorporate the effects of feature ablations in the feature interface. 
-- Implement an interface to see "Feature Activations on Example Texts" as done by Anthropic [here](https://transformer-circuits.pub/2023/monosemantic-features/vis/a1-math.html).
-- Modify the code so that one can train a sparse autoencoder on activations of any MLP / attention layer.
+<p align="center">
+  <img src="./assets/shakespeare.png" width="700" />
+</p>
 
-### Related Work
-There are several other very interesting works on the web exploring sparse dictionary learning. Here is a small subset of them.
+## Future Work
 
-- [Sparse Autoencoders Find Highly Interpretable Features in Language Models by Cunningham, et al.](https://arxiv.org/abs/2309.08600)
-- [Sparse Autoencoders Work on Attention Layer Outputs by Kissane, et al.](https://www.lesswrong.com/posts/DtdzGwFh9dCfsekZZ/sparse-autoencoders-work-on-attention-layer-outputs)
-- [Joseph Bloom's SAE codebase](https://github.com/jbloomAus/mats_sae_training) along with a blogpost on [trained SAEs for all residual stream layers of GPT-2 small](https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream) 
-- [Neel Nanda's SAE codebase](https://github.com/neelnanda-io/1L-Sparse-Autoencoder) along with a [blogpost](https://www.lesswrong.com/posts/fKuugaxt2XLTkASkk/open-source-replication-and-commentary-on-anthropic-s)
-- [Callum McDougall's exercises on SAEs](https://github.com/callummcdougall/sae-exercises-mats/tree/main)
-- [SAE library by AI Safey Foundation](https://github.com/ai-safety-foundation/sparse_autoencoder)
+I'm really interested in the intersection of "low compute / efficient training" + "model interpretability". Anthropic mentions the challenges
+of scaling this approach up to large models in their paper, and they mention it more in their successful extraction of Claude 3 Sonnet
+features paper.
 
+As models continue to get bigger, and open source continues to try to catch up, it seems valuable to have well established
+off-the-shelf techniques to decompose and interpret the inner workings of a transformer. Among other things, the steerability and application
+benefits are significant. The golden gate bridge feature demo was a fun example, but the tactical benefits and possibilities make the
+concept of a system prompt seem somewhat antiquated.
@@ -1,5 +1,5 @@
 """"
-Prepares training dataset for our autoencoder. 
+Prepares training dataset for our autoencoder.
 Run on Macbook as
 python -u prepare.py --num_contexts=5000 --num_sampled_tokens=16 --dataset=shakespeare_char --gpt_ckpt_dir=out_sc_1_2_32
 """
@@ -16,9 +16,9 @@
 gpt_ckpt_dir = 'out'  # Model checkpoint directory
 # autoencoder data size
 num_contexts = int(2e6)  # Number of context windows
-num_sampled_tokens = 200  # Tokens per context window
+num_sampled_tokens = 200  # Tokens per context window -- TODO: why isn't this just the same as block size?
 # system
-device = 'cpu'
+device = 'mps'
 num_partitions = 20  # Number of output files
 # reproducibility
 seed = 0
@@ -46,24 +46,24 @@
 
 def compute_activations():
     start_time = time.time()
-    gpt_batch_size = 500
+    gpt_batch_size = 500 # ?
     n_batches = num_contexts // gpt_batch_size
 
     for batch in range(n_batches):
         # Load batch and compute activations
-        x, _ = resource_loader.get_text_batch(gpt_batch_size)
+        x, _ = resource_loader.get_text_batch(gpt_batch_size) # (gpt_batch_size, block_size)
         _, _ = gpt(x)  # Forward pass
-        activations = gpt.mlp_activation_hooks[0]  # Retrieve activations
+        activations = gpt.mlp_activation_hooks[0]  # (gpt_batch_size, block_size, 4 * n_embd) -- 4x because the inner layer is 4x the size of the input
 
         # Clean up to save memory
         gpt.clear_mlp_activation_hooks()
 
         # Process and store activations
-        token_locs = torch.stack([torch.randperm(block_size)[:num_sampled_tokens] for _ in range(gpt_batch_size)])
-        data = torch.gather(activations, 1, token_locs.unsqueeze(2).expand(-1, -1, activations.size(2))).view(-1, n_ffwd)
-        data_storage[shuffled_indices[batch * gpt_batch_size * num_sampled_tokens : (batch + 1) * gpt_batch_size * num_sampled_tokens]] = (
-            data
-        )
+        token_locs = torch.stack([torch.randperm(block_size)[:num_sampled_tokens] for _ in range(gpt_batch_size)])  # (gpt_batch_size, num_sampled_tokens)
+        data = torch.gather(activations, 1, token_locs.unsqueeze(2).expand(-1, -1, activations.size(2))).view(-1, n_ffwd)  # (gpt_batch_size * num_sampled_tokens, n_ffwd)
+        data_storage[
+            shuffled_indices[batch * gpt_batch_size * num_sampled_tokens : (batch + 1) * gpt_batch_size * num_sampled_tokens]
+        ] = data
 
         print(
             f"Batch {batch}/{n_batches} processed in {(time.time() - start_time) / (batch + 1):.2f} seconds; "
 
@@ -5,6 +5,7 @@
 python train.py --dataset=shakespeare_char --gpt_ckpt_dir=out_sc_1_2_32 --eval_iters=1 --eval_batch_size=16 --batch_size=128 --device=cpu --eval_interval=100 --n_features=1024 --resampling_interval=150
 """
 
+from tqdm import trange
 import os
 import torch
 import numpy as np
@@ -74,8 +75,7 @@
 start_time = time.time()
 num_steps = resourceloader.autoencoder_data_info["total_examples"] // batch_size
 
-for step in range(min(num_steps, 2500)):
-
+for step in trange(num_steps, desc="Training Autoencoder"):
     batch = resourceloader.get_autoencoder_data_batch(step, batch_size=batch_size)
     optimizer.zero_grad(set_to_none=True)
     autoencoder_output = autoencoder(batch)  # f has shape (batch_size, n_features)
@@ -89,7 +89,7 @@
     if step % 1000 == 0:
         autoencoder.normalize_decoder_columns()
 
-    ## ------------ perform neuron resampling ----------- ######
+    ###### ------------ perform neuron resampling ----------- ######
     # check if we should start investigating dead/alive neurons at this step
     # This is done at an odd multiple of resampling_interval // 2 in Anthropic's paper.
     if autoencoder.is_dead_neuron_investigation_step(step, resampling_interval, num_resamples):
@@ -110,7 +110,7 @@
                 data=resourceloader.select_resampling_data(size=resampling_data_size), optimizer=optimizer, batch_size=batch_size
             )
 
-    ### ------------ log info ----------- ######
+    ###### ------------ log info ----------- ######
     if (step % eval_interval == 0) or step == num_steps - 1:
         print(f'Entering evaluation mode at step = {step}')
         autoencoder.eval()
@@ -128,9 +128,7 @@
         feat_acts_count = torch.zeros(n_features, dtype=torch.float32)
 
         # get batches of text data and evaluate the autoencoder on MLP activations
-        for iter in range(eval_iters):
-            if iter % 20 == 0:
-                print(f'Performing evaluation at iterations # ({iter} - {min(iter+19, eval_iters)})/{eval_iters}')
+        for iter in trange(eval_iters, desc="Evaluating Autoencoder"):
             x, y = resourceloader.get_text_batch(num_contexts=eval_batch_size)
 
             _, nll_loss = gpt(x, y)
@@ -175,6 +173,15 @@
                 'feature_density/num_alive_neurons': len(log_feat_acts_density),
             }
         )
+        # Print log_dict in a readable format
+        print("Evaluation Results:")
+        print("-" * 40)
+        for key, value in log_dict.items():
+            if isinstance(value, float):
+                print(f"{key:<35} {value:.6f}")
+            else:
+                print(f"{key:<35} {value}")
+        print("-" * 40)
 
         autoencoder.train()
         print(f'Exiting evaluation mode at step = {step}')