Interpretable AI: Attention Maps, Sparse Autoencoders, and Steering Vectors Explained with Code

Peek Inside the Model's Brain: From Black Box to Glass Box.

I deployed a loan-approval model. High accuracy. Then a regulator asked: 'Why did you reject this applicant?' I had no answer. These three techniques change that.

Primary Objective

👁️ Attention Maps | 🧩 Sparse Autoencoders | 🎛️ Steering Vectors

💡

The Neuroscience Analogy

Interpretability for LLMs is like neuroscience for the brain. You can reason about behavior, but mechanistic understanding is an active research field. Use these techniques as guides and evidence, not absolute ground truth.

Technique 1: Attention Maps (The "What are you looking at?" Test)

Attention weights tell you what the model "looks at" when generating each token. Using circuitsvis, we can visualize these relationships in real-time.

# Visualize Attention with circuitsvis
import circuitsvis as cv
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2", output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained("gpt2")

text = "The doctor asked the nurse to help him with the patient."
tokens = tokenizer.tokenize(text)
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# outputs.attentions is a tuple of [layers, batch, heads, seq, seq]
# Visualize Layer 0, Head 0
cv.attention.attention_heads(
    tokens=tokens, 
    attention=outputs.attentions[0][0]
)

💡

Common Patterns to Search For

Induction Heads: Heads that look back at previous instances of the current token to predict the next (crucial for in-context learning).
Semantic Heads: Heads that attend to words in the same semantic category (e.g., attending to "doctor" when seeing "patient").
Bias Check: Does the head for "him" attend more strongly to "doctor" than "nurse"? This reveals internal gender biases.

Technique 2: Sparse Autoencoders (SAEs)

Models pack millions of concepts into a small number of dimensions—a phenomenon called superposition. SAEs act as a decompressor, untangling these meanings into "interpretable features."

Superposition vs. Sparse Latents

In the Model: A single neuron might fire for "colors," "legal text," and "the word 'Paris'."
In the SAE: Each latent feature represents exactly one concept. Feature #4023 fires ONLY for "deceptive intent."

# Conceptual SAE Implementation (PyTorch)
class SparseAutoencoder(nn.Module):
    def __init__(self, d_model, d_sae):
        super().__init__()
        self.encoder = nn.Linear(d_model, d_sae)
        self.decoder = nn.Linear(d_sae, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        # x: model hidden states [batch, d_model]
        # latent: sparse representation [batch, d_sae]
        latent = self.relu(self.encoder(x))
        reconstructed = self.decoder(latent)
        return reconstructed, latent

💡

Anthropic Research Insight

Using SAEs on Claude 3, researchers found a "Golden Gate Bridge" feature. By artificially boosting this feature's activation, they made Claude mention the bridge in every single response, regardless of the prompt.

Technique 3: Steering Vectors (Behavioral Control)

If concepts are directions in the model's activation space, we can "steer" the model by adding or subtracting these directions during inference.

# Applying a Steering Vector during generation
def generate_with_steering(model, tokenizer, prompt, steering_vector, alpha=1.0):
    # 1. Get activations for the prompt
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # 2. Add the steering vector to the hidden states at a specific layer
    # hidden_states = hidden_states + (alpha * steering_vector)
    
    # 3. Complete the forward pass with the modified states
    # This effectively 'nudges' the model's 'thought process'
    pass

Steering in Action: Tone Control

NEGATIVE STEERING (-2.0)

"The weather is absolute trash. I hate everything about this day."

NEUTRAL (0.0)

"The weather is overcast today with a slight chance of rain."

POSITIVE STEERING (+2.0)

"What a magnificent, refreshing day! The clouds look like beautiful art."

Practical Workflow: Diagnose & Intervene

The Interpretability Cycle

👁️

01: OBSERVE

Use Attention Maps to identify which tokens are driving a specific output.

🧩

02: DIAGNOSE

Use SAE Features to see if a "bias" or "safety" feature is active.

🎛️

03: STEER

Apply a Steering Vector in the opposite direction to neutralize the behavior.

✅

04: VALIDATE

Re-run the test to ensure the intervention worked without breaking other capabilities.

Key Takeaways

Attention is not Explanation

Just because a model looks at a word doesn't mean it's the 'reason' for the answer. Interpretability requires multiple layers of evidence.

Mechanistic over Behavioral

Don't just look at what the model says (behavioral). Look at the internal activations (mechanistic) to find the 'why.'

Steering is safer than Fine-tuning

You can toggle a steering vector on/off per request. Fine-tuning is permanent and can lead to 'catastrophic forgetting' of other skills.