Comprehensive Mechanistic Module

Overview

Mechanistic interpretability is the subfield of AI safety and explainability research that aims to reverse-engineer neural networks by identifying the computational algorithms implemented by their internal components. Rather than treating models as opaque black boxes, mechanistic interpretability researchers decompose model behavior into human-understandable circuits, features, and mechanisms. The field has grown rapidly since 2022, with dedicated workshops at NeurIPS and ICML, major research programs at Anthropic, Google DeepMind, and independent organizations like MATS and Redwood Research, and a rich ecosystem of open-source tools. Key techniques include activation patching, sparse autoencoders (SAEs), circuit analysis, logit lens, probing classifiers, and causal scrubbing. This comprehensive module provides an overview of the entire field: its core concepts, major techniques, the tool ecosystem, research workflows, and practical guidance for getting started with interpretability research.

When to Use

Understanding model behavior: When you need to explain why a model produces specific outputs, not just what it produces.
AI safety research: Identify features related to deception, sycophancy, harmful content generation, or other safety-relevant behaviors.
Circuit discovery: Find the minimal subnetwork (circuit) responsible for a specific capability, like indirect object identification or modular arithmetic.
Feature engineering insights: Understand what representations models learn to inform better architectures and training procedures.
Model auditing: Verify that models are using appropriate features for decisions rather than spurious correlations.
Research career development: Build foundational knowledge for a career in mechanistic interpretability research.

Quick Start

Tool Installation


# Core libraries
pip install transformer-lens    # Hooked transformer models
pip install nnsight              # Any PyTorch model access
pip install sae-lens             # Sparse autoencoders
pip install circuitsvis          # Attention visualization

# Supporting libraries
pip install torch transformers einops fancy_einsum
pip install plotly matplotlib seaborn

First Experiment: Logit Lens


from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")
prompt = "The capital of France is"
tokens = model.to_tokens(prompt)

# Run model and cache all activations
_, cache = model.run_with_cache(tokens)

# Logit lens: project each layer's residual stream to vocabulary
for layer in range(model.cfg.n_layers):
    residual = cache["resid_post", layer]   # [batch, pos, d_model]
    # Apply final layer norm and unembedding
    normed = model.ln_final(residual)
    logits = normed @ model.W_U             # [batch, pos, vocab]

    probs = torch.softmax(logits[0, -1], dim=-1)
    top_token = model.to_str_tokens(probs.argmax().unsqueeze(0))[0]
    top_prob = probs.max().item()
    print(f"Layer {layer:2d}: {top_token:>12s} ({top_prob:.3f})")

Core Concepts

The Residual Stream View

Modern transformers are best understood as a residual stream that each layer reads from and writes to:

Embedding --> [+ Attn_0 + MLP_0] --> [+ Attn_1 + MLP_1] --> ... --> LayerNorm --> Unembed
              ^^^^^^^^^^^^^^^^        ^^^^^^^^^^^^^^^^
              Layer 0 residual         Layer 1 residual
              contribution             contribution

Each attention head and MLP layer adds to the residual stream. The final prediction depends on the cumulative sum of all contributions.

Key Techniques Overview


# 1. ACTIVATION PATCHING
# Swap activations between runs to test causal hypotheses
clean_cache = run_with_cache("The Eiffel Tower is in")
with corrupt_run("The Colosseum is in"):
    layer_8_output[:] = clean_cache["resid_post", 8]
    # If prediction changes from "Rome" to "Paris",
    # layer 8 is causally important

# 2. SPARSE AUTOENCODERS
# Decompose activations into interpretable features
from sae_lens import SAE
sae = SAE.from_pretrained(release="gpt2-small-res-jb", ...)
features = sae.encode(activations)
# Each feature is a monosemantic concept

# 3. PROBING CLASSIFIERS
# Train linear probes on activations to test for information
from sklearn.linear_model import LogisticRegression
probe = LogisticRegression()
probe.fit(activations_at_layer_5, labels)
# High accuracy = information is linearly represented

# 4. ATTENTION PATTERN ANALYSIS
# Visualize what tokens attend to what
attn_patterns = cache["attn", layer]  # [batch, head, q_pos, k_pos]
# Look for induction heads, copy heads, previous token heads

Circuit Analysis Workflow


from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small", device="cuda")

def find_important_components(model, prompt, target_token_id):
    """Identify which components matter for a prediction."""
    tokens = model.to_tokens(prompt)
    _, cache = model.run_with_cache(tokens)

    # Baseline logit for target token
    baseline_logits = model(tokens)
    baseline = baseline_logits[0, -1, target_token_id].item()

    # Test each attention head
    head_importance = {}
    for layer in range(model.cfg.n_layers):
        for head in range(model.cfg.n_heads):
            def ablation_hook(activation, hook, layer=layer, head=head):
                activation[:, :, head, :] = 0
                return activation

            ablated_logits = model.run_with_hooks(
                tokens,
                fwd_hooks=[(f"blocks.{layer}.attn.hook_z", ablation_hook)]
            )
            ablated = ablated_logits[0, -1, target_token_id].item()
            head_importance[(layer, head)] = baseline - ablated

    # Sort by importance
    sorted_heads = sorted(
        head_importance.items(), key=lambda x: abs(x[1]), reverse=True
    )
    return sorted_heads[:10]  # Top 10 most important heads

paris_id = model.to_single_token(" Paris")
important = find_important_components(
    model, "The capital of France is", paris_id
)
for (layer, head), importance in important:
    print(f"L{layer}H{head}: importance={importance:.3f}")

Induction Head Detection


def detect_induction_heads(model, seq_len=100):
    """Detect induction heads by measuring copying behavior."""
    # Create repeated random token sequence: [A B C D ... A B C D ...]
    random_tokens = torch.randint(1000, 10000, (1, seq_len))
    repeated = torch.cat([random_tokens, random_tokens], dim=1)

    _, cache = model.run_with_cache(repeated)

    scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
    for layer in range(model.cfg.n_layers):
        attn = cache["attn", layer][0]  # [heads, q_pos, k_pos]
        for head in range(model.cfg.n_heads):
            # Induction heads attend to token seq_len positions back
            # (the previous occurrence of the current token)
            induction_stripe = attn[head].diagonal(offset=-seq_len + 1)
            scores[layer, head] = induction_stripe.mean().item()

    return scores

scores = detect_induction_heads(model)
print("Top induction heads:")
for idx in scores.flatten().topk(5).indices:
    layer = idx.item() // model.cfg.n_heads
    head = idx.item() % model.cfg.n_heads
    print(f"  L{layer}H{head}: score={scores[layer, head].item():.3f}")

Configuration Reference

Tool Ecosystem

Tool	Purpose	Best For
TransformerLens	Hooked transformers with cache	Activation analysis, circuit discovery
NNsight	Any PyTorch model access	Architecture-agnostic research, remote execution
SAELens	Sparse autoencoder training/analysis	Feature discovery, steering
pyvene	Declarative interventions	Shareable, reproducible experiments
CircuitsVis	Attention visualization	Visual inspection of attention patterns
Neuronpedia	Feature browser	Exploring pre-trained SAE features

Research Resources

Resource	Type	Link
ARENA 3.0	Curriculum	arena3-chapter1-transformer-interp
Neel Nanda's YouTube	Tutorials	Concrete Steps to Get Started
200 Concrete Problems	Problem Set	alignmentforum.org
Anthropic Circuits	Research	transformer-circuits.pub
MATS Program	Research Mentorship	matsprogram.org

Best Practices

Start with TransformerLens on GPT-2: GPT-2 Small is the workhorse model for interpretability research. It is small enough to run locally, well-studied with known circuits, and has pre-trained SAEs available.
Use the logit lens as your first diagnostic: Before any complex experiment, check what the model predicts at each layer. This immediately tells you where the important computation happens.
Think in terms of circuits, not individual neurons: Single neurons are polysemantic. Focus on identifying circuits (subnetworks of attention heads and MLPs) that implement specific functions.
Always include proper controls: When patching activations, include random patching controls and corruption baselines. Without controls, you cannot distinguish causal effects from noise.
Validate SAE features before trusting them: A feature with a high activation is not automatically meaningful. Check that it activates consistently across semantically similar inputs and does not activate for unrelated inputs.
Read the foundational papers: "A Mathematical Framework for Transformer Circuits" (Anthropic), "Towards Monosemanticity" (Anthropic), and "Interpretability in the Wild" (Wang et al.) provide the conceptual foundation.
Join the community: The Alignment Forum, EleutherAI Discord, and MATS community are active venues for discussing research, getting feedback, and finding collaborators.
Reproduce existing results first: Before pursuing novel research, reproduce a known result (like the IOI circuit or induction heads). This builds skills and validates your experimental setup.
Track your experiments systematically: Use Weights & Biases or MLflow to log experimental parameters, results, and visualizations. Interpretability research involves many experiments with subtle variations.
Be honest about limitations: Report negative results and failed hypotheses. The field advances faster when researchers share what does not work alongside what does.

Troubleshooting

TransformerLens model not loading Verify the model name matches the HuggingFace model ID. Not all models have TransformerLens wrappers. Check HookedTransformer.get_model_names() for supported models.

Activation cache consuming too much memory Use model.run_with_cache(tokens, names_filter=lambda name: "resid_post" in name) to cache only specific activation types. On GPU, use torch.float16 to halve memory usage.

Attention patterns look uniform or random This is normal for many heads in many layers. Most heads do not have easily interpretable attention patterns. Focus on heads identified as important through ablation studies.

SAE features all look the same The L1 coefficient may be too low, producing dense features. Increase sparsity or use TopK architecture. Also verify the hook point is correct.

Probing classifier has high accuracy on random labels Your probe may be overfitting. Use proper train/test splits, regularization, and check baseline accuracy with shuffled labels. Linear probes should use logistic regression, not deep networks.

⚠️ Loading Issue

Comprehensive Mechanistic Module

Comprehensive Mechanistic Module

Overview

When to Use

Quick Start

Tool Installation

First Experiment: Logit Lens

Core Concepts

The Residual Stream View

Key Techniques Overview

Circuit Analysis Workflow

Induction Head Detection

Configuration Reference

Tool Ecosystem

Research Resources

Best Practices

Troubleshooting

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace