Mechanistic Interpretability

Reverse-engineering neural networks to understand the algorithms they implement

Lead Summary

Mechanistic interpretability is a research program that attempts to reverse-engineer neural networks—to understand not just what they output but how they compute it. Rather than treating a trained model as a black box and probing it with inputs, mechanistic interpretability examines the internal computational machinery: which neurons activate, how attention heads select tokens, how information flows through layers, and which subnetworks are causally responsible for specific behaviors.

The approach is motivated by a core inadequacy of behavioral methods. Even extraordinary behavioral competence does not establish what internal computational processes are actually running—a model producing contextually appropriate, philosophically sophisticated responses could be implementing those responses through mechanisms entirely unlike human reasoning. Mechanistic interpretability aims to close this gap by moving from behavioral evidence to causal evidence derived from controlled interventions on model internals.

The field is not a unified research program but a space of distinct interpretability paradigms—circuit discovery, sparse autoencoders, representation engineering, developmental tracking, and causal abstraction—each making different commitments about what constitutes a valid explanation, what level of granularity matters, and what causal guarantees are required. These paradigms operate in different explanatory frames, and the choice of method is itself an implicit claim about what kind of answer is being sought.

Core Concepts

The Residual Stream

The foundational insight enabling mechanistic analysis of transformers is the structure of the residual stream. In transformer architectures, all components—embedding, attention heads, MLP layers, and the unembedding—read from and write to a shared communication channel in an additive, non-destructive way. Each component projects its input from the residual stream, performs computation, and adds the result back rather than replacing the existing state.

This additive structure has a critical consequence: the residual stream exhibits approximate linearity, enabling direct attribution of output logit differences to individual component contributions. Each component's additive contribution roughly corresponds to a change in the final prediction, making the residual stream a reliable substrate for causal analysis. Without this property, reverse engineering would be substantially harder.

The residual stream is not a passive conduit but an active communication medium. Every component reads what prior components have written and adds its own contribution to the shared state, making multi-step computation traceable.

Circuits

A circuit in mechanistic interpretability is formally defined as a computational subgraph: a subset of nodes (attention heads and MLP neurons) and edges (information flows through the residual stream) that collectively implement an identifiable algorithm sufficient to perform a specific computation. The graph-theoretic formulation enables systematic isolation and analysis of functional components across layers.

A circuit must be "sufficient" for the computation—ablating or patching the circuit must measurably disrupt the target behavior. Multi-layer circuits communicate exclusively through the residual stream: information written by an early layer becomes input for later layers via K-composition, where one head modifies keys that subsequent heads can attend to.

QK/OV Decomposition

Attention heads can be decomposed into two largely independent circuits. The QK (query-key) circuit determines the attention pattern—which tokens a query token attends to. The OV (output-value) circuit determines how attended-to tokens influence the model's output. This decomposition is mathematically meaningful and enables more precise mechanistic analysis of attention head behavior, treating attention as two separable computations rather than one monolithic operation.

Layer Specialization

Transformers exhibit layer specialization where different layers develop functionally distinct roles: shallow layers extract low-level features, intermediate layers perform complex logical operations, and deep layers integrate information for final predictions. This implicit computational pipeline supports multi-step reasoning. Interpretability work has identified intermediate conceptual steps in this process, revealing how models decompose reasoning into implicit chains of thought visible to layer-granularity analysis.

Mechanisms and Phenomena

Superposition

The central challenge for mechanistic interpretability is superposition: the phenomenon where neural networks encode more features than they have neurons by assigning features to an overcomplete set of directions in activation space. Rather than one neuron encoding one concept, features are represented as linear combinations of neuron activations, forming an overcomplete basis for the activation space.

Superposition is a fundamental mechanism by which high-dimensional networks achieve compression of semantic information. The consequence for interpretability is severe: as models scale, superposition becomes more prevalent, making it increasingly difficult to extract sparse, human-interpretable circuits because the boundary between "feature" and "circuit" blurs in high-dimensional, distributed representations.

Polysemanticity

Related to superposition but distinct from it, polysemanticity is the observation that individual neurons respond to multiple, unrelated semantic concepts. A single neuron may activate for "Python code," "serpents," and "sneak previews"—making it impossible to attribute a single interpretable meaning to individual neural units. While superposition implies polysemanticity (more features than neurons forces multiple concepts per neuron), the reverse is not true: polysemanticity can arise through other mechanisms, including non-neuron-aligned orthogonal features and compositional representations.

Polysemanticity is a primary roadblock to mechanistic interpretability: it prevents researchers from identifying concise, human-understandable explanations for what specific components encode.

Induction Heads and In-Context Learning

One of the most studied circuits in mechanistic interpretability is the induction head: a multi-head circuit formed by the composition of at least two attention heads that implements in-context learning by identifying and copying repeated patterns. The circuit works through the residual stream: one head writes position information, another reads it, enabling a multi-step algorithm—locate tokens matching recent context, identify their successors, output those successors.

Induction heads emerge through an abrupt phase transition during training rather than gradually. This emergence correlates with a visible discontinuity in the training loss curve and coincides precisely with a sudden sharp increase in in-context learning ability. Models without induction heads show minimal few-shot performance; models that have developed them show substantial improvement.

Induction heads and larger models

Induction heads are a foundational mechanism for in-context learning in smaller models, though their role as the dominant mechanism is qualified — importance varies by task and correlates with the presence of function-vector heads. In larger models, function vector (FV) heads become more causally important—they compute latent task representations rather than performing explicit pattern matching, and their importance for ICL performance increases with scale. Many FV heads appear to emerge as elaborations of earlier induction head implementations.

Transformers typically implement multiple induction head circuits rather than one, showing partial redundancy and specialization. The presence of multiple heads provides robustness, though the full division of labor among them is not yet understood.

Beyond simple token copying, induction circuits can implement fuzzy matching — using semantic similarity rather than exact token identity — and hierarchical circuits that operate on tree-structured dependencies rather than just linear sequences. The basic match-and-copy algorithm scales from literal pattern copying to abstract pattern completion, providing a mechanistic grounding for a range of in-context learning behavior.

Negation and Semantic Suppression

Mechanistic interpretability has begun characterizing how models process linguistic negation — a particularly important case for AI safety, since negative prompts that seem to prohibit behaviors may actually activate the very concepts they seek to suppress.

Two competing hypotheses are under active investigation. The construction hypothesis holds that negated representations are built positively — the model constructs a representation for the negated content before suppressing it, meaning the content is activated as an intermediate step. The suppression hypothesis holds that models directly suppress activation of negated content. Evidence suggests construction is more prominent than suppression: late-layer attention heads in several models implement negation by first activating concept representations and then modulating them, with causal ablation experiments confirming this sequence. This finding has direct implications for prompt engineering: instructing a model "do not discuss X" may paradoxically increase the salience of X in internal representations before suppression occurs.

Methods

Activation Patching

Activation patching (also called interchange intervention, causal tracing, or resample ablation) is the primary method for establishing causal attribution. The procedure: replace internal activations in a model run with activations from a different input, then measure how this substitution changes downstream outputs. When combined with multivariate path patching, this technique can isolate which components causally contribute to specific behaviors, distinguishing correlation from causation.

Activation patching exploits the residual stream's approximate linearity. The technique has confirmed the causal importance of induction heads for in-context learning and of specific attention heads for tasks like indirect object identification. However, it rests on a realism assumption that can be violated: some patching experiments multiply feature values by up to 15x, creating model states far outside the distribution of natural activations, which may invalidate the causal claims derived from them.

Probing Classifiers and Their Limits

Probing classifiers train a small linear classifier on model activations to detect whether specific information (linguistic features, factual content) is encoded in representations. The method is widely used but has a fundamental correlation-causation gap: probes indicate only correlations between representations and properties, not causal relationships. A probe can succeed even when the probed feature plays no causal role in the model's behavior. Additionally, a complex enough probe can learn the task independently, making it unclear whether the probe reveals model knowledge or its own learning capacity.

Logit Lens

The logit lens is an analysis tool that probes the residual stream at different layers by applying the model's unembedding matrix to intermediate residual stream states. This technique reveals how the model's token-level interpretations evolve across layers, showing the residual stream progressively refining representations toward task-appropriate predictions. The technique applies beyond the global residual stream to individual attention heads and SAE features, enabling detailed inspection of intermediate computations.

Sparse Autoencoders

Sparse autoencoders (SAEs) are the central methodological response to the polysemanticity problem. The core insight: if neurons are polysemantic because the model uses superposition to pack more features than neurons, then training an auxiliary network to find a larger, sparser representation should recover the underlying monosemantic features.

Architecture

An SAE employs a specific architectural design with a hidden layer substantially larger than the input (typically 4x to 256x the input dimension), trained with an objective combining MSE reconstruction loss and an L1 penalty on the hidden activations to encourage sparsity. The L1 penalty pushes most hidden units toward zero activation for any given input, while the overcomplete hidden layer provides enough directions to represent all features separately. This can be grounded in information theory: the L1+MSE objective approximates Minimal Description Length, minimizing code length plus reconstruction error.

What SAEs Find

SAEs successfully disentangle superimposed polysemantic features into monosemantic units, where each unit activates reliably for semantically coherent concepts. Anthropic's Scaling Monosemanticity paper (May 2024) applied SAEs to Claude 3 Sonnet, extracting tens of millions of interpretable features—demonstrating that SAE scaling follows predictable scaling laws and revealing abstract, multilingual, multimodal features that generalize across contexts. Safety-relevant features were identified, including features related to deception, sycophancy, bias, and dangerous content.

Features learned by SAEs trained on different architectures show significant similarity to each other, suggesting they capture model-agnostic linguistic structure. Universal Sparse Autoencoders create shared feature spaces across model families (Gemma, Llama, Qwen), enabling model-agnostic interpretability frameworks.

SAE-Based Circuits

Sparse feature circuits combine SAEs with circuit analysis to map fine-grained causal interactions between individual SAE features across layers. Linear approximations identify which SAE features are causally implicated in specific outputs, enabling construction of explicit causal graphs where nodes are interpretable features and edges represent activation dependencies.

Feature Steering

Because SAEs decompose activations into monosemantic features, researchers can identify features corresponding to desired behavioral outcomes and directly manipulate their activations to steer model behavior. Applications include mitigating hallucinations by zeroing features causally implicated in hallucination production, and suppressing harmful outputs through gradient-free inference-time interventions.

Alternative Interpretability Paradigms

Circuit analysis and SAEs are not the only approaches to mechanistic interpretability. Each paradigm makes different tradeoffs between causal explanatory strength, scalability, computational cost, and result portability.

Representation Engineering

Representation engineering identifies and manipulates high-level concepts in distributed activation space through contrastive input pairs, without requiring circuit-level mechanistic understanding. Rather than tracing causal circuits, it uses matched-pair trial designs—contrasting inputs that isolate target concepts—to detect semantic directions and perform surgical edits. The approach operationalizes the Linear Representation Hypothesis: that high-level concepts are encoded as linear directions in activation space, with concept intensity determined by activation magnitude along those directions.

Representation engineering scales better to large models than circuit discovery and enables concept control (boosting honesty, suppressing harmful outputs) without enumerating circuits. The tradeoff is weaker causal guarantees.

Transcoders

Transcoders represent a successor architecture to SAEs that achieve superior interpretability metrics. Unlike SAEs which operate on activations, transcoders are trained to reconstruct the output of a neural network component given its input. Skip transcoders—which add an affine skip connection—achieve both lower reconstruction loss and higher average interpretability scores. Cross-layer transcoders enable circuit tracing by creating a continuous feature-space mapping where upstream feature activations causally contribute to downstream activations, allowing construction of complete attribution graphs for individual predictions.

A distinctive advantage of transcoders is their factorization of MLP layers into input-dependent and input-invariant components — cleanly separating context-sensitive feature interactions from fixed knowledge stored in weights.

Developmental Interpretability

Developmental interpretability tracks how neural network competencies emerge across training phases, rather than analyzing static trained models. The paradigm uses metrics like the Local Learning Coefficient to detect phase transitions—abrupt reorganizations of internal representations that manifest as discontinuous capability jumps despite smooth external loss curves. This shifts focus from the final trained model to the learning trajectory itself, revealing computational circuit formation in ways post-hoc circuit analysis cannot.

During LLM training, internal representational phase transitions occur as abrupt reorganizations despite smooth external loss curves, correlating with sudden capability emergence (few-shot arithmetic at 13B parameters, chain-of-thought at 100B). Standard training metrics obscure these critical dynamics.

Linguistic / Grammar-Based Interpretability

The linguistic interpretability paradigm investigates whether hidden states and attention patterns encode recursive grammatical structures defined by context-free grammars. Research demonstrates that transformer models implicitly represent CFG structure through boundary-to-boundary attention patterns resembling dynamic programming, place tree node information on subtree boundaries, and develop specialized attention heads for specific dependency relations. This paradigm asks what formal linguistic properties are captured, rather than tracing computational pathways—a fundamentally different explanatory frame.

Lossy-Compression View

The lossy-compression view reframes LLM training through Information Bottleneck theory: pre-training optimally compresses input information while retaining only task-relevant features. This paradigm operates at whole-model scale rather than circuit or feature granularity, allowing interpretability analysis to predict downstream performance while grounding explanations in established learning compression theories. It challenges the implicit assumption that interpretable circuits are the primary unit of analysis.

Fig 1

Paradigm trade-offs: causal strength vs. scalability

Controversies and Debates

Causal Abstraction: Power and Triviality

Causal abstraction provides a theoretical foundation that unifies activation patching, path patching, circuit analysis, and concept erasure in a common formal language. It formalizes when a high-level causal model constitutes a faithful simplification of low-level neural computation.

However, recent critiques demonstrate that unrestricted causal abstraction becomes mathematically trivial: under reasonable assumptions, any neural network can be mapped to any algorithm, rendering the framework uninformative without additional constraints. The non-uniqueness of identified algorithms creates "interpretability illusions"—multiple distinct algorithms can satisfy the same causal abstraction constraints, limiting interpretive power. The field lacks clear standards for when an abstraction constitutes a valid causal model versus a post-hoc description that happens to predict behavior.

Scalability Barriers

Circuit discovery methods face fundamental scalability barriers at frontier scale: existing algorithms require independent inferences per iteration, making them prohibitively expensive; interventions on high-dimensional SAE feature spaces are computationally intractable; and learned circuits remain difficult to interpret due to polysemantic component behavior. The canonical circuit analyses (such as the indirect object identification circuit in GPT-2 small) study models far smaller and architecturally simpler than modern deployed systems.

Architecture and Scale Dependency

Mechanistic interpretability findings exhibit high architecture and scale dependency: circuit-level findings from smaller models with single-headed attention often fail to generalize to modern multi-headed architectures or scaled models. This raises the question of whether circuit analyses uncover fundamental computational principles or merely describe implementation details of specific models at specific scales. The prevalence and importance of different mechanisms (e.g., induction heads vs. function vector heads) varies systematically with architecture.

Confirmation Bias and Replication

Leading practitioners have documented a pattern of initially celebrated findings being substantially weakened or contradicted by subsequent research. The field exhibits confirmation bias, toy-problem focus, and oversimplification. Neel Nanda, one of the field's most prominent researchers, has stated that "the most ambitious vision of mechanistic interpretability I once dreamed of is probably dead" and that he does "not see a path to deeply and reliably understanding what AIs are thinking" based on accumulated limitations.

Standardization gap

The field lacks standardized metrics for evaluating SAE quality and interpretability. Multiple competing definitions of key concepts—"superposition," "faithfulness," "in-context learning"—exist across venues. Formal evaluation frameworks for circuit fidelity remain under development. This absence of appropriate metrics is acknowledged as a main open problem.

Applications and Extensions

AI Safety

The primary motivation for mechanistic interpretability research is AI safety. Behavioral evidence alone is insufficient to determine whether a model has properties like deception, value misalignment, or hidden goals—these require examining internal computational structures. Feature steering demonstrations on frontier models (identifying and manipulating features related to deception, sycophancy, and dangerous content in Claude 3 Sonnet) represent concrete steps toward mechanistic safety evaluation.

The opacity of neural networks persists even at the expert level, meaning that both users and developers face fundamental limits on accessing the computational justifications for outputs. Mechanistic interpretability attempts to close this gap, though current methods remain far from providing the kind of reliable, comprehensive understanding that safety-critical deployment would require.

Multimodal and Non-Language Models

SAEs and circuit tracing have been extended beyond language-only models. SAE techniques have been applied to CLIP and DinoV2, demonstrating that superposition and SAE-based interpretability generalize to visual representations. Circuit tracing has been extended to vision-language models, where transcoders decompose multimodal representations into interpretable cross-modal features. Applications to protein language models have extracted biologically meaningful features, suggesting that mechanistic interpretability methods may generalize well beyond their original domain.

Hallucination Detection and Mitigation

Mechanistic interpretability enables inference-time feature steering to mitigate hallucinations: by identifying SAE features or intermediate representations causally implicated in hallucination production, practitioners can suppress erroneous outputs without retraining models. Intermediate Representation Injection reinforces visual grounding by intervening on cross-modal representations. Feature-level detoxification methods can zero features associated with toxicity at inference time.

Current Status

Mechanistic interpretability is a young field with substantial open problems. The January 2025 paper Open Problems in Mechanistic Interpretability formalized goals but revealed that core concepts like "feature" still lack rigorous definitions.

Current momentum is focused on three directions: (1) developing scalable circuit discovery algorithms that can address frontier-scale models, (2) building unified feature-space frameworks—universal SAEs, cross-layer transcoders—that achieve cross-model generalization, and (3) establishing formal, standardized metrics for interpretability evaluation. The developmental interpretability paradigm is emerging as a complement to post-hoc analysis, tracking capability formation during training rather than analyzing static endpoints.

Key Takeaways

Mechanistic interpretability seeks causal explanations of neural computation Rather than treating models as black boxes, the field examines internal computational structures through circuit analysis, activation patching, and feature decomposition to understand how models implement their outputs.
The residual stream enables causal analysis through approximate linearity Transformer architectures use an additive communication channel where components read from and write to shared state, making individual contributions traceable and enabling attribution of output changes to specific components.
Superposition and polysemanticity present the central interpretability challenge Neural networks encode more features than they have neurons, causing individual units to respond to multiple unrelated concepts. Sparse autoencoders address this by finding larger, sparser representations that recover monosemantic features.
Multiple competing interpretability paradigms make different causal and scalability tradeoffs Circuit analysis, sparse autoencoders, representation engineering, developmental interpretability, and linguistic approaches each prioritize different explanatory strengths, with circuit-level work offering stronger causal claims but worse scalability.
Field faces significant limitations in causal abstraction, replicability, and generalization Causal abstraction can become mathematically trivial, scalability barriers prevent circuit discovery on frontier models, and findings exhibit high dependency on architecture and scale, raising questions about whether methods capture fundamental principles or implementation details.

Further Exploration

Foundational Papers

A Mathematical Framework for Transformer Circuits — Elhage et al. (2021) — establishes residual stream formalism
In-context Learning and Induction Heads — Olsson et al. (2022) — definitive account of induction circuits
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning — Bricken et al. (2023) — introduces sparse autoencoders
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet — Templeton et al. (2024) — frontier model feature extraction

Recent Surveys and Reviews

Open Problems in Mechanistic Interpretability — Sharkey et al. (2025) — field-wide assessment
A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models — Rai et al. (2024) — comprehensive survey
A Survey on Sparse Autoencoders — Covers architectures, training, evaluation across modalities

Methods and Mechanisms

Which Attention Heads Matter for In-Context Learning — Yin & Steinhardt (2025) — function vector heads vs. induction heads
Transcoders Beat Sparse Autoencoders for Interpretability — Skip transcoders as SAE successors
Tracing the thoughts of a large language model — Circuit tracing applied to Claude

Alternative Paradigms

Representation Engineering — High-level concept identification and manipulation
Developmental Interpretability — Tracking competency emergence across training
Linguistic / Grammar-Based Interpretability — Formal grammatical structure in hidden states

Foundational Theory and Critique

Causal Abstraction Framework — Unifies activation patching and circuit analysis
Triviality of Unrestricted Causal Abstraction — Formal limits of causal abstraction without constraints
Confirmation Bias in Mechanistic Interpretability — Neel Nanda on field limitations
Neel Nanda's Mechanistic Interpretability Glossary — Practical reference for terminology