Engineering

Chain-of-Thought: Evidence and Limits

What the research actually says about when CoT works, when it fails, and why its output cannot be trusted as a debugging signal

Learning Objectives

By the end of this module you will be able to:

  • Describe the scale-dependency of CoT gains and identify model/task combinations where CoT is unlikely to help.
  • Explain the CoT faithfulness problem and its consequences for using visible reasoning as a debugging signal.
  • Diagnose intermediate step error propagation in multi-step reasoning chains.
  • Distinguish tasks where CoT provides genuine reasoning support from tasks where it amplifies pattern matching.

Core Concepts

What chain-of-thought prompting is

Chain-of-thought (CoT) prompting instructs a model to produce explicit intermediate reasoning steps before arriving at a final answer. In its simplest form, adding "think step by step" to a prompt is enough; more structured variants supply few-shot examples that demonstrate what a reasoning trace should look like.

The technique attracted wide attention after Wei et al. (NeurIPS 2022) showed substantial gains on arithmetic, commonsense, and symbolic reasoning benchmarks. That paper is the primary source most practitioners cite — but its conditions matter more than its headline numbers.

Scale-dependent emergence

The gains observed by Wei et al. are not uniform across models. CoT benefits are an emergent property that only manifests above a critical scale threshold of roughly 100 billion parameters. Below that threshold, models show minimal or inconsistent improvement. The transition is not gradual: it is a qualitative shift in capability, not a smooth upward curve.

Small model trap

Applying CoT to a model below the ~100B parameter threshold is not a neutral choice. The visible reasoning steps consume tokens, add latency, and may not improve — and in some cases actively hurt — accuracy. The technique was characterized on large models; its costs apply at any scale.

This scale-dependence also limits cross-model transfer. Each model family tends to develop its own reasoning patterns shaped by its training data and architecture. When you attempt to port a CoT prompt from one model family to another, architectural and stylistic compatibility — not just model size — determines whether the transfer works. Smaller or architecturally mismatched models exhibit limited and inconsistent CoT benefits even when the prompt is technically valid.

The faithfulness problem

The most consequential and under-appreciated finding in CoT research is that the visible reasoning trace is frequently unfaithful to the model's actual computational process.

Models generate plausible-sounding intermediate steps that do not reflect the factors actually driving their predictions. Training objectives do not incentivize models to accurately report the reasons for their behavior. The result is post-hoc narrative construction: a coherent-looking explanation that was written after the answer was, in some sense, already determined by internal activations.

"Current CoT techniques are often over-trusted despite producing coherent-sounding but unfaithful reasoning." — Barez & Wu, Chain-of-Thought Is Not Explainability

Specific evidence for unfaithfulness:

  • Models' explanations can be predictably influenced by biasing features in the input that the model never explicitly mentions in its reasoning trace. The bias shapes the output, but the reasoning pretends it doesn't exist.
  • Models produce answers consistent with their own intermediate structures yet fail to update when those structures are explicitly modified. The reasoning and the answer are not coupled the way they appear to be.
  • An Anthropic study on measuring faithfulness in CoT confirmed that the surface text of a reasoning chain is an unreliable guide to what the model is computing.

The practical consequence: you cannot use CoT output as a debugging signal in the way you would use a stack trace or a log file. A clean-looking chain of thought does not mean the model reasoned correctly, and a messy chain does not always mean it reasoned incorrectly.

Consistency failures

A related phenomenon is CoT consistency failure. A model may produce the correct final answer via two different, mutually inconsistent reasoning paths when asked the same question in slightly different framings. Two types are well-documented:

  • Hypothetical consistency failures: the answer should remain invariant in a slightly altered context, but it doesn't.
  • Compositional consistency failures: substituting the model's own sub-step output into a subsequent prompt should yield the same final answer, but it changes it.

These failures confirm that correct answers are not always supported by sound reasoning processes. The surface coherence of a chain of thought masks internal inconsistency.

Intermediate step error propagation

In a multi-step reasoning chain, errors at early steps do not stay local — they propagate and amplify. Research on CoT perturbation shows that errors early in reasoning chains grow through subsequent steps, with degradation present across open-weight models from 7B to 120B parameters. Intermediate reasoning steps effectively pollute the context window, making chains vulnerable to cumulative failure.

Two error types are particularly common in mathematical CoT:

  • Calculation errors: arithmetic mistakes despite a structurally correct problem setup.
  • Logical errors: violations of inference rules that compound downstream.

Aggregation strategies that only track final answers — such as majority voting across samples — will miss these intermediate failures entirely. A model can arrive at the right answer for wrong reasons, and aggregation makes this invisible.

Pattern matching versus genuine reasoning

There is active contention in the research community about whether CoT enables genuine abstract reasoning or whether it reflects sophisticated pattern matching against reasoning templates from pretraining.

The evidence favors a sobering interpretation: models appear to execute CoT by memorizing a map of reasoning schemata from training data and interpolating between known patterns to solve new problems. Key indicators:

  • Models treat mathematically equivalent sequences as statistically different tokens — surface form, not logical structure, governs their behavior.
  • When reasoning paths are longer than those seen in training, performance drops noticeably.
  • Minor format changes cause significantly more mistakes than the changes' logical content would warrant.
  • Token-level analysis of reasoning chains reveals explicit memorization of reasoning templates, not generalizable logical inference.

This framing explains distribution shift failures: CoT reasoning becomes unreliable when pushed beyond the distribution of patterns present in pretraining. The technique is not a reasoning engine — it is a pattern interpolation engine that happens to produce reasoning-shaped output when the problem is close enough to something it has seen before.

The mirage problem

One framing from recent literature describes CoT as a "brittle mirage" under out-of-distribution conditions. Benchmarks that show impressive CoT performance may be measuring how well the model matches training-time reasoning templates, not how well it generalizes to novel logical structures.

CoT and pragmatic reasoning

From the pragmatics framing introduced in Module 02, CoT can be understood as a multi-step speech act sequence. Instructing a model to reason step-by-step substantially improves performance on pragmatic tasks — resolving implicatures, handling indirect speech acts — but these improvements represent shallow pattern adaptation, not principled pragmatic inference. They are context-dependent and may not generalize to novel pragmatic scenarios. The mechanism is the same: pattern interpolation against training data, not structural reasoning.


Common Misconceptions

"CoT gives me visibility into how the model thinks." The reasoning trace shows you what the model wrote, not what it computed. These are different things. Unfaithful rationales are well-documented: the trace can be systematically misleading about the actual factors driving the output. CoT is not an interpretability tool.

"If the reasoning trace looks correct, the answer is reliable." Consistency failures show that a model can produce a correct-looking reasoning trace and still produce different answers when the same problem is slightly reframed. A coherent trace is not sufficient evidence of sound reasoning.

"Adding 'think step by step' always helps." It only reliably helps on large models (roughly 100B+ parameters), on tasks that are well-represented in training data, and on problems whose reasoning length falls within the distribution of training examples. Below the scale threshold, or on out-of-distribution problems, CoT can hurt or have no effect.

"A correct final answer means the intermediate steps were valid." Intermediate step errors are a distinct failure mode from final answer errors. A model can reach the right answer via incorrect intermediate steps, and those errors may cause failures on structurally similar problems later. Tracking only final answers misses this class of failure.


Annotated Case Study

Scenario: debugging a CoT prompt for a multi-step financial calculation

A team is using a ~70B parameter model to perform multi-step financial calculations: given a set of quarterly figures, derive year-over-year growth rates, adjust for inflation, and produce a final comparison. They add CoT with the instruction "reason through each step before giving your answer."

What happens:

The model produces fluent, internally coherent reasoning traces. Spot checks of a few examples look correct. They ship the feature.

In production, they discover that roughly 15% of outputs have incorrect final answers — but the reasoning traces look fine. They cannot identify the failure pattern by reading traces.

What the research explains:

  1. Scale threshold: At ~70B parameters, the model is below the ~100B threshold for reliable emergent CoT gains. The technique provides inconsistent benefits on this model class.

  2. Unfaithful rationales: The reasoning traces are coherent but do not reflect what drove the calculation. The model is constructing a plausible narrative after the numerical output is already shaped by its internal process.

  3. Intermediate step error propagation: Calculation errors in early steps (e.g., an arithmetic mistake in the inflation adjustment) propagate through subsequent steps. Because each step builds on the last, the error compounds. The final answer is wrong, but the trace describes correct-sounding reasoning because the model narrates what "should" happen, not what it actually computed.

  4. Pattern matching: The specific format of the financial data (column headers, decimal formatting, currency symbols) does not closely match the distribution of financial reasoning problems in training data. The model is interpolating rather than reasoning, and the interpolation is noisy.

The diagnostic failure:

The team's instinct — read the reasoning trace to find the bug — is precisely the wrong approach. Unfaithful rationales mean the trace cannot be trusted to locate the computation error. The correct approach is to evaluate intermediate outputs independently: extract each step's numerical output and validate it against a known-correct calculation, treating the chain of thought as untrusted output rather than as a reliable process log.

Debugging CoT failures

When CoT produces wrong answers, do not start by reading the trace. Instead: (1) isolate each intermediate numerical output and verify it independently, (2) test whether the same prompt with minor reformatting produces different answers (consistency check), and (3) test whether the model produces correct answers on similar problems with shorter reasoning chains (distribution check).


Boundary Conditions

When CoT is likely to help:

  • Models at or above ~100B parameters.
  • Tasks that are well-represented in training data (arithmetic, code, commonsense reasoning on common patterns).
  • Problems whose required reasoning chain length falls within the distribution of training examples.
  • Tasks where aggregating across multiple sampled reasoning paths (self-consistency) is feasible.

When CoT is likely to hurt or be neutral:

  • Models below the ~100B parameter scale threshold.
  • Novel problem structures not well-represented in training data — the interpolation map doesn't cover them.
  • Reasoning chains longer than training examples — performance drops noticeably at the distribution boundary.
  • Adversarial or reformatted inputs — semantically irrelevant perturbations cause accuracy drops of over 42% in code-reasoning tasks.
  • Contexts where you need a reliable debugging signal — the trace is not one.

When CoT gives false confidence:

  • Any time you are reading a trace to verify correctness. Consistency failures and unfaithful rationales mean a clean trace does not validate the answer.
  • Production systems that only track final answer accuracy. Intermediate step errors are invisible to this metric and can cause correlated failures on structurally similar problems.

Architectural considerations: CoT fragility to perturbations is not a small-model problem. It is present across open-weight models from 7B to 120B parameters, indicating a fundamental property of how transformers implement multi-step reasoning — not a scale artifact. Do not assume that a larger model solves the fragility problem.


Compare & Contrast

CoT versus standard prompting

DimensionStandard promptingChain-of-thought
Output lengthShort, directLonger, includes reasoning trace
Performance on complex tasks (large models)LowerHigher
Performance on simple tasksComparable or better (less noise)No advantage; added token cost
DebuggabilityOutput is the outputTrace looks like a debug log but is unreliable
Robustness to reformattingModerateMore fragile — multi-step chains are disrupted by perturbations
Scale requirementsWorks across scalesReliable gains only at ~100B+

CoT versus self-consistency

Self-consistency (sampling multiple reasoning paths and taking a majority vote) addresses one CoT failure mode — variance in individual traces — but does not address the faithfulness problem or intermediate step errors. Aggregating over unfaithful traces does not make them faithful. Self-consistency improves final answer accuracy on mathematical tasks despite individual path inconsistencies, but this masks the intermediate error patterns rather than fixing them.

CoT as interpretability tool (what it is not)

CoT is sometimes positioned alongside mechanistic interpretability work as a way to understand model reasoning. This conflates two different things. Mechanistic interpretability examines internal activations and attention weights to infer what the model is computing. CoT produces a text sequence that the model writes alongside its answer. These are not the same thing. Barez & Wu make this distinction explicit: chain-of-thought is not explainability.

Key Takeaways

  1. CoT gains are scale-gated. The technique reliably improves performance only on models at approximately 100B+ parameters. Below that threshold, CoT provides inconsistent or no benefit, while still consuming tokens and adding latency.
  2. The reasoning trace is not the computation. CoT rationales are frequently unfaithful: models produce coherent-sounding explanations that do not reflect the actual factors driving their predictions. Do not use a CoT trace as a debugging signal in the way you would use a stack trace.
  3. Intermediate errors propagate and compound. Mistakes in early reasoning steps amplify through subsequent steps. This failure mode is invisible to metrics that only track final answer accuracy. Validating intermediate outputs independently is the only reliable diagnostic.
  4. CoT is pattern interpolation, not abstract reasoning. Models interpolate between reasoning templates memorized from training data. Performance degrades when problems require reasoning chains longer than training examples, when formatting shifts, or when the problem structure is out-of-distribution.
  5. Consistency failures are common. A model can produce the correct answer via internally inconsistent reasoning paths. A clean-looking trace is not evidence of sound reasoning — it is evidence that the model wrote a clean-looking trace.

Further Exploration

Primary research