Peek Inside the Model's Brain: From Black Box to Glass Box.
I deployed a loan-approval model. High accuracy. Then a regulator asked: 'Why did you reject this applicant?' I had no answer. These three techniques change that.
Interpretability for LLMs is like neuroscience for the brain. You can reason about behavior, but mechanistic understanding is an active research field. Use these techniques as guides and evidence, not absolute ground truth.
Technique 1: Attention Maps (The "What are you looking at?" Test)
Attention weights tell you what the model "looks at" when generating each token. Using circuitsvis, we can visualize these relationships in real-time.
# Visualize Attention with circuitsvis
import circuitsvis as cv
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2", output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "The doctor asked the nurse to help him with the patient."
tokens = tokenizer.tokenize(text)
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# outputs.attentions is a tuple of [layers, batch, heads, seq, seq]
# Visualize Layer 0, Head 0
cv.attention.attention_heads(
tokens=tokens,
attention=outputs.attentions[0][0]
)
- Induction Heads: Heads that look back at previous instances of the current token to predict the next (crucial for in-context learning).
- Semantic Heads: Heads that attend to words in the same semantic category (e.g., attending to "doctor" when seeing "patient").
- Bias Check: Does the head for "him" attend more strongly to "doctor" than "nurse"? This reveals internal gender biases.
Technique 2: Sparse Autoencoders (SAEs)
Models pack millions of concepts into a small number of dimensions—a phenomenon called superposition. SAEs act as a decompressor, untangling these meanings into "interpretable features."
- In the Model: A single neuron might fire for "colors," "legal text," and "the word 'Paris'."
- In the SAE: Each latent feature represents exactly one concept. Feature #4023 fires ONLY for "deceptive intent."
# Conceptual SAE Implementation (PyTorch)
class SparseAutoencoder(nn.Module):
def __init__(self, d_model, d_sae):
super().__init__()
self.encoder = nn.Linear(d_model, d_sae)
self.decoder = nn.Linear(d_sae, d_model)
self.relu = nn.ReLU()
def forward(self, x):
# x: model hidden states [batch, d_model]
# latent: sparse representation [batch, d_sae]
latent = self.relu(self.encoder(x))
reconstructed = self.decoder(latent)
return reconstructed, latent
Using SAEs on Claude 3, researchers found a "Golden Gate Bridge" feature. By artificially boosting this feature's activation, they made Claude mention the bridge in every single response, regardless of the prompt.
Technique 3: Steering Vectors (Behavioral Control)
If concepts are directions in the model's activation space, we can "steer" the model by adding or subtracting these directions during inference.
# Applying a Steering Vector during generation
def generate_with_steering(model, tokenizer, prompt, steering_vector, alpha=1.0):
# 1. Get activations for the prompt
inputs = tokenizer(prompt, return_tensors="pt")
# 2. Add the steering vector to the hidden states at a specific layer
# hidden_states = hidden_states + (alpha * steering_vector)
# 3. Complete the forward pass with the modified states
# This effectively 'nudges' the model's 'thought process'
pass
Steering in Action: Tone Control
"The weather is absolute trash. I hate everything about this day."
"The weather is overcast today with a slight chance of rain."
"What a magnificent, refreshing day! The clouds look like beautiful art."
Practical Workflow: Diagnose & Intervene
The Interpretability Cycle
Use Attention Maps to identify which tokens are driving a specific output.
Use SAE Features to see if a "bias" or "safety" feature is active.
Apply a Steering Vector in the opposite direction to neutralize the behavior.
Re-run the test to ensure the intervention worked without breaking other capabilities.
Key Takeaways
Just because a model looks at a word doesn't mean it's the 'reason' for the answer. Interpretability requires multiple layers of evidence.
Don't just look at what the model says (behavioral). Look at the internal activations (mechanistic) to find the 'why.'
You can toggle a steering vector on/off per request. Fine-tuning is permanent and can lead to 'catastrophic forgetting' of other skills.