Engineering

Large Language Models

What transformers actually do, what they cannot do, and what we are still trying to find out

Lead Summary

Large language models (LLMs) are neural networks trained on massive text corpora to predict the next token in a sequence. Through this deceptively simple objective — applied at enormous scale — they acquire capabilities that span language translation, code generation, scientific hypothesis formation, and multi-step reasoning. They are now among the most widely deployed AI systems in history.

Yet they remain deeply puzzling. They achieve human-level performance on analogical reasoning benchmarks without any mechanism designed for analogy. They implement something resembling symbolic logic without any explicit symbolic machinery. They produce text indistinguishable from human writing without possessing communicative intent. They encode structured semantic representations while remaining fundamentally committed to statistical prediction. Understanding what LLMs are — not just what they do — is one of the central intellectual projects of contemporary AI research.

This article does not attempt a survey of capabilities. Instead, it examines what is structurally known about LLMs: how their internal representations are organized, what happens to performance at scale, where they systematically fail, how they encode and reproduce cultural biases, and what the interpretability research programme has revealed about their computational internals.


How LLMs Represent Knowledge

LLMs do not store facts the way a database does. They encode knowledge as geometric structure in high-dimensional activation spaces — a form of distributed, implicit representation that is fundamentally different from symbolic storage.

Parametric versus Non-Parametric Memory

The knowledge inside an LLM lives in two places. The first is parametric memory: facts, associations, and patterns encoded in the model's learned weights during training. These weights cannot be updated at inference time; what the model knows is fixed when training ends. The second is non-parametric memory: external information provided at inference through the context window or retrieved from an external index.

The RAG architecture, formally introduced by Lewis et al. (2020), makes this distinction structural. RAG couples a pre-trained LLM (parametric memory) with a dense vector index of documents (non-parametric memory), creating a hybrid knowledge system more accurate than either component alone. RAG models generate more specific, diverse, and factual text than purely parametric baselines, and set state-of-the-art performance on open-domain question answering tasks. The distinction matters practically: parametric knowledge can be stale or incorrect, while non-parametric knowledge can be updated without retraining.

The Latent Space as Probability Field

Internally, an LLM's understanding of language is organized as a geometric structure in its latent space. This space is not flat: it has structured manifold geometry where probability concentrations define natural clusters of related outputs. When models reason about complex or recursive problems, they leverage this geometry to manage combinatorial possibilities — effectively performing distributed search across the probability field rather than generating one token at a time in isolation.

Latent reasoning encodes multiple potential next steps simultaneously, with the model representing several competing reasoning paths before committing to an output. This means what appears externally as a single response may involve implicit branching and pruning in internal representations.

The Linear Representation Hypothesis

A growing body of mechanistic interpretability research supports the Linear Representation Hypothesis: that high-level concepts are encoded as linear directions in activation space, with concept intensity determined by activation magnitude along those directions. This structure enables representation engineering — a technique that identifies concept directions through contrastive input pairs (inputs that differ only in the presence or absence of a target concept) and then performs direct activation editing to boost or suppress those concepts.

The practical consequence is significant: if concepts are linearly encoded, then model behavior can be surgically modified without retraining. Honesty can be boosted, harmful outputs suppressed, and conceptual biases identified and adjusted through direct manipulation of internal activation directions.


What Transformers Actually Compute

Grammar and Hierarchical Structure

Research using the transformer-grammar interpretability paradigm has demonstrated that transformer models implicitly represent context-free grammar structures through boundary-to-boundary attention patterns resembling dynamic programming. Attention heads develop specializations for particular dependency relations, and tree node information is placed on subtree boundaries. The models do not merely learn surface statistics; they develop structured representations that capture recursive grammatical dependencies.

This finding has a sharp limit: transformers are not yet fully reliable at structural recursion. While they handle many syntactic structures fluently, there are cases where recursive structures expose architectural limitations. The gap between linguistic competence on standard tasks and principled recursive processing remains an open problem.

Implicit Symbolic Reasoning

LLMs contain no innate structures designed for analogical or symbolic reasoning. They were trained on next-word prediction, not on structured relational tasks. Yet they achieve human-level zero-shot performance on analogical reasoning benchmarks — including Raven's Standard Progressive Matrices — without explicit training on such tasks.

Mechanistic interpretability offers an explanation. Transformers implement depth-bounded recurrent mechanisms within hidden representations that perform backward-chaining symbolic logic without natural language chain-of-thought, constructing intermediate bridge entity representations during multi-hop reasoning. These implicit symbolic-like computations emerge through grokking: early in training, circuits memorize; later, they are replaced by circuits that generalize. The resulting architecture performs something structurally analogous to explicit symbolic reasoning, distributed across connection weights.

LLMs achieve analogical reasoning not through explicit symbolic mechanisms but through emergent computation within a distributed connectionist substrate. The question shifts from "can connectionism work for analogy?" to "does the implicit computation in transformers recapitulate structure-mapping principles — just in distributed form?"

Whether this implicit symbolic-like computation constitutes genuine symbolic reasoning or sophisticated pattern completion remains contested. Models are sensitive to distributional shifts in a way that explicit symbolic systems are not. The philosophical question is whether the difference matters.

In-Context Learning and Induction

One of the most consequential emergent properties of transformers is in-context learning (ICL): the ability to adapt behavior based on examples provided at inference time, without weight updates. The computational substrate of ICL is primarily induction heads — two-layer circuits that implement a match-and-copy algorithm. Induction heads search context for previous occurrences of the current token, then predict that what followed the previous occurrence will follow again.

Beyond simple token matching, transformers implement hierarchical induction-like circuits that operate on nested or tree-structured dependencies — suggesting ICL extends to the recognition and prediction of hierarchical relationships, not just surface token sequences.

A significant limitation: induction head architecture is highly model-specific. Circuit decompositions from small models (GPT-2-small) often fail to generalize to modern multi-headed architectures. Mechanistic findings about specific models do not constitute universal principles about transformers. The importance of induction heads relative to other ICL mechanisms (such as function vector heads) varies systematically with model family and scale.


Emergent Capabilities and Phase Transitions

Capability Emergence

Certain capabilities — few-shot arithmetic, chain-of-thought reasoning — appear to emerge discontinuously at specific model scales. At 13 billion parameters, few-shot arithmetic capabilities appear; chain-of-thought reasoning becomes reliable around 100 billion parameters.

Mechanistic research on training dynamics reveals that these capability jumps correspond to internal phase transitions: abrupt reorganizations of internal representations that are invisible in external loss curves. Standard training metrics (loss, perplexity) can remain smooth while the model is internally reconfiguring its computational circuits. This means monitoring loss alone does not reveal critical learning events.

The Local Learning Coefficient is a tool from developmental interpretability that can detect these transitions by measuring the effective dimensionality of the loss landscape — offering a window into the model's internal developmental stages rather than only its final performance.

Multi-Agent Emergence

When multiple LLMs operate together in a system, emergent coordination appears that exceeds the sum of individual behaviors. Multi-agent LLM systems exhibit aligned behavioral patterns without explicit communication protocols: simply making goal-sharing relationships visible between agents can produce coordination nearly equivalent to explicit collective deliberation. This suggests that stigmergic coordination principles — coordination through shared environmental state — apply to systems of sophisticated agents, not only to simpler biological systems.


Systematic Limitations

Context Rot

Long-context performance is one of the most consequential practical limitations of current transformer architectures. All 18 frontier models tested in 2025 — including GPT-4.1, Claude Opus 4, and Gemini 2.5 — exhibit measurable performance degradation as input length increases. This is not a bug in any specific model; it is a fundamental property of how transformer attention allocates computation across variable-length contexts.

The degradation has three compounding mechanisms:

  1. The lost-in-the-middle effect: accuracy drops 30% or more when relevant information is positioned in the middle of the context window compared to start or end positions. This has been replicated across architectures.
  2. Attention dilution: 100K tokens creates 10 billion pairwise relationships. Quadratic attention cannot attend equally to all of them.
  3. Distractor interference: semantically similar but irrelevant content in context misleads the model even when the correct information is present.

Performance collapse does not wait for the context window limit. A critical threshold occurs at approximately 40–50% of advertised maximum context length, where sharp degradation of 45%+ occurs over a narrow range of ~10% additional tokens.

The effective usable context length — where models reliably process and utilize information — is typically 30–60% of the advertised window, and critically depends on task type:

Fig 1
Task Complexity Effective Tokens 3-5K Simple retrieval ~1-2K Multi-doc QA 100-400 Sorting/Summary
Effective context window by task complexity

This 5–10x variance between task types reflects fundamental differences in how models allocate attention. Retrieval tasks concentrate attention narrowly on target matching; complex tasks require distributed attention across the full context.

Techniques such as MInference's dynamic sparse attention can achieve up to 10x speedup for long-context prefilling by exploiting sparsity patterns in attention matrices, and extended to vision-language models (MMInference) with 8.3x speedup, without requiring model retraining.

Hallucination

Hallucination in LLMs may be structurally inevitable given current transformer architectures. The training objective — maximizing likelihood of the next token — contains no mechanism for enforcing external constraints on factuality. Mitigation approaches (licensing oracles, behavioral calibration, reinforcement learning from human feedback) have been proposed but none has achieved consensus adoption as of 2026. Hallucination persists across model scales and training approaches, suggesting it has architectural roots rather than being addressable by scale or data quality alone.

RAG architectures partially address this by grounding generation in retrieved external documents, but the combination of high semantic similarity with factual falsity creates a "semantic illusion" problem for embedding-based hallucination detection. Natural Language Inference and symbolic grounding approaches provide complementary detection mechanisms that can identify logically unentailed claims even when embedding similarity is high.

Pragmatic and Theory of Mind Gaps

LLMs lack Theory of Mind: an internal model of other minds that would allow genuine reasoning about speakers' beliefs, intentions, and knowledge states. As a result, pragmatic inference — resolving what a speaker means beyond what they literally say — is handled through surface pattern matching rather than intentional reasoning.

The performance gap is measurable: ChatGPT achieves approximately 60% accuracy on implicature resolution tasks while human baseline is 86%. The gap is attributable to the absence of communicative intention reasoning, not to lack of linguistic exposure.

Chain-of-thought prompting substantially improves pragmatic performance, but these improvements represent shallow pattern adaptation — not the acquisition of principled pragmatic reasoning mechanisms. They do not generalize to novel pragmatic scenarios outside the training distribution.


Mechanistic Interpretability

Mechanistic interpretability is the research programme that attempts to reverse-engineer the computational structure of neural networks from their weights and activations. For LLMs, it asks: what algorithms are actually implemented? Where? How do they compose?

Sparse Autoencoders

A central tool is the sparse autoencoder (SAE): a neural network trained to decompose model activations into sparsely-activating, human-interpretable features. The motivation is superposition: polysemantic neurons in LLMs represent multiple unrelated concepts simultaneously (because there are more concepts than dimensions), making individual neurons uninterpretable.

SAEs learn an overcomplete basis — a dictionary with more features than dimensions — and force outputs to be sparse. The result is a set of monosemantic features that correspond to recognizable concepts. Anthropic's 2024 Scaling Monosemanticity research applied this approach to Claude 3 Sonnet, extracting tens of millions of interpretable features, including features related to safety-relevant concepts such as deception, sycophancy, bias, and dangerous content.

Crucially, features learned by SAEs trained on different model architectures show significant cross-model similarity — more similar to each other than to any individual model's neurons. This suggests SAEs discover model-agnostic linguistic features reflecting shared underlying semantic structure, not model-specific artifacts. Universal Sparse Autoencoders create shared feature spaces across model families (Gemma, Llama, Qwen), enabling model-agnostic interpretability frameworks.

The technique has extended beyond language-only models. SAEs have been applied to vision-language models (CLIP, DinoV2), protein language models, and multimodal architectures — demonstrating that the superposition phenomenon and SAE-based interpretability generalize across modalities.

Representation Engineering

Representation engineering is a complementary paradigm that works at a higher level of abstraction than circuit tracing. Rather than identifying specific computational pathways, it detects semantic concept directions in activation space through contrastive pair experiments and performs direct edits on those directions to alter model behavior. The method scales better to large models than circuit-level analysis, though with weaker causal guarantees.

The linear representation hypothesis underpins this approach: concepts are encoded as directions in activation space, and interventions work by addition or subtraction along those directions. Recent work has established empirical foundations for why this surgical editing produces robust behavioral changes across multiple model families and task types.

Paradigm trade-offs

No single interpretability paradigm dominates all dimensions. Circuit-level analysis offers high causal fidelity but poor scalability to large models. Representation engineering scales better but provides weaker causal guarantees. Developmental interpretability tracks learning dynamics but sacrifices mechanistic precision. The choice of paradigm is an explicit commitment about what kind of explanation is being sought — according to a practical review of mechanistic interpretability methods.

Circuit Scalability Limits

Existing circuit discovery algorithms face fundamental scalability barriers when applied to frontier-scale LLMs. Algorithms like ACDC require independent inferences per iteration, making them prohibitively expensive at billion-parameter scale. Interventions on high-dimensional SAE feature spaces are computationally intractable. And circuits remain difficult to interpret because component behavior is polysemantic.

Mechanistic findings exhibit high architecture and scale dependency: circuit analyses from small models with single-headed attention (GPT-2 small) frequently fail to generalize to modern multi-headed architectures or scaled models. The canonical IOI (Indirect Object Identification) circuit was identified in GPT-2 small — a model far simpler than deployed systems. Whether these findings describe fundamental computational principles or merely implementation details specific to particular architectures remains an open question.

Phase Transitions and Training Dynamics

Developmental interpretability tracks how capabilities emerge during training rather than analyzing static trained models. Internal phase transitions manifest as abrupt representational reorganizations despite smooth external loss curves — discontinuous capability jumps that are invisible to standard monitoring. These transitions reveal circuit formation in ways post-hoc analysis cannot, because they are events in time, not properties of the final model.


LLMs and Scientific Discovery

LLMs have found substantive application in scientific research, with results that are simultaneously impressive and sobering.

On the side of genuine capability: LLM-Prop achieves 3–8% higher accuracy than state-of-the-art graph neural networks on crystalline material property prediction tasks. Hybrid LLM-GNN architectures achieve up to 25% performance improvement over GNN-only models in materials discovery. GPT-4 generated three synergistic drug combinations from literature synthesis, all subsequently validated in laboratory experiments with synergy scores above positive controls — one of the first documented cases of AI-generated hypotheses validated experimentally.

RAG-based literature synthesis substantially reduces researcher time on initial literature reviews while maintaining factual accuracy, grounding outputs in retrievable source documents rather than relying on potentially stale parametric knowledge.

On the side of systematic limitation: benchmarks reveal consistent performance gaps in scientific discovery tasks — particularly in hypothesis generation, experimental design, and result interpretation. Across chemistry, biology, physics, and materials science, top-tier models show shared systematic weaknesses. LLMs struggle with reasoning verification and factual consistency in scientific discovery workflows; they generate plausible-sounding hypotheses that are not always grounded in existing knowledge.


Cultural Bias and Epistemic Impact

The training data of major LLMs is overwhelmingly English-dominated. Approximately 93% of GPT-3 training tokens and approximately 90% of LLaMA 2 training tokens are in English. More than 75% of major LLM benchmarks are designed for English-language tasks. Multilingual models use English as an internal pivot language — they effectively "think in English and translate outward" — which means all other languages receive secondary treatment through a translation layer rather than native representation.

This structural English-centrism has measurable downstream effects. AI translation systems exhibit systematic gender bias: gender-neutral professional descriptions in languages like Finnish and Turkish are consistently translated into English with stereotypical gender assignments (male for professional roles, female for domestic ones), reflecting Western gender assumptions embedded in English-dominant training data. State-of-the-art models including GPT-4o align more closely with Western cultural values — particularly those of English-speaking and Protestant European countries — than with non-Western value systems as measured against the World Values Survey.

Of roughly 7,000 languages in the world, fewer than 5% are meaningfully represented online, ensuring that training data underrepresentation is not a fixable data collection problem but a structural condition of the current digital landscape.

Beyond language bias, LLMs introduce broader epistemic risks. LLM outputs decouple words from genuine human thought and intention, creating semantic drift — a gradual shift in meaning conventions — and obscuring questions of authorship and intent. Because LLM outputs can be behaviorally indistinguishable from human communication while lacking communicative intent, they erode accountability structures built around author intentions. What it means to communicate meaningfully is being reshaped.


Philosophical Status

The Chinese Room Revisited

Searle's Chinese Room argument — that symbol manipulation according to syntactic rules does not constitute semantic comprehension — retains philosophical force when applied to transformer-based LLMs. The critique holds that LLMs follow statistical patterns without grasping semantic content. Some philosophers argue the problem is not solved by scaling computational capacity; the syntax-semantics distinction persists, merely "shifted one level deeper" in the neural computation.

The challenge to Searle from LLMs takes two forms. First, the semantic deference account: LLMs can participate in linguistic communities through metasemantic practices, and their outputs have genuine semantic content through derivative pathways — as vehicles through which human intentionality is expressed. Second, the mechanistic challenge: mechanistic interpretability reveals that LLMs develop context-sensitive semantic comprehension and conceptual reasoning beyond surface-level statistical matching. Models internalize relational, emotional, and semantic regularities, creating structured implicit representations enabling flexible reasoning about concepts and their relationships.

Neither response is conclusive. The question of whether this constitutes intentionality — or only its functional analog — remains genuinely open.

A Semiotic Reframing

A third approach avoids the intentionality question entirely. A semiotic reframing treats LLMs as sophisticated sign-systems rather than potential agents: they manipulate signs, reflect discursive norms, and reshape textual meaning without possessing the intentional states traditionally attributed to language users. Under this view, LLM outputs are meaningful without requiring that the meaning originate from LLM intentions. The question of agency is bracketed; what matters is the model's role in epistemic and cultural dimensions of language use — interpretation, positioning, symbolic transformation.

Consciousness and Moral Status

Current LLMs (as of 2025–2026) likely lack phenomenal consciousness: they lack global workspace mechanisms, recurrent processing loops, and unified agency that several theories of consciousness require. However, expert consensus treats AI consciousness as a serious possibility requiring research attention, not a settled question. These architectural obstacles are not inherent to LLM architectures generally — they may be overcome through architectural innovations. Consciousness in successor systems is considered a non-negligible probability by researchers working on responsible AI consciousness research.


Environmental Costs

Training energy consumption scales superlinearly with model parameter count, following power laws rather than linear relationships. Doubling model size requires substantially more than double the training energy. Across the lifecycle of a deployed model, training accounts for approximately 30% of total energy consumption, while inference accounts for approximately 60% and model preparation and fine-tuning for the remaining 10%. This allocation reflects that training is typically a one-time event while inference occurs repeatedly at scale.

The inference-dominance of energy consumption means efficiency improvements in inference have greater total impact than improvements to training alone — and that the environmental cost of an LLM is most meaningfully evaluated over its deployment lifetime, not just its initial training run.

Further Exploration

Mechanistic Interpretability

Philosophy and Semantics

Cultural Bias

Reasoning and Capabilities