Engineering

Context Engineering and Chunking

The information you give a model is as engineered as the instruction you write

Learning Objectives

By the end of this module you will be able to:

  • Select an appropriate chunking strategy for a given content type and retrieval task.
  • Explain position bias in long-context windows and structure prompts to mitigate it.
  • Compare recursive and hierarchical retrieval approaches on cost, context quality, and implementation complexity.
  • Apply decomposition and expert chunking principles to organize context for multi-step prompts.

Core Concepts

Context is half the prompt

When engineers think about prompts, they tend to focus on the instruction side: the system prompt, the task wording, the output format. But the retrieved or injected context that accompanies those instructions is equally load-bearing. Garbage in, garbage out applies just as much to the documents you surface as to the directions you write.

Context engineering is the practice of deciding what information to supply to a model, how to segment it, and where to place it. The decisions compound: a wrong chunking strategy can render good retrieval useless; correct retrieval placed in the wrong position can be ignored anyway.

Chunking: the unit-of-retrieval problem

Chunking is text segmentation — splitting source documents into retrievable units before they enter a vector store or are injected into a prompt. The core tension is between two competing goods:

  • Retrieval precision: smaller units match queries more cleanly.
  • Generation quality: the model needs enough surrounding context to reason correctly.

Research on small chunks makes this concrete. Semantic chunking frequently produces fragments averaging 43 tokens. Those fragments retrieve accurately, but at generation time the LLM lacks the context to synthesise cross-paragraph reasoning — producing the paradox of high recall and wrong answers. In one benchmark, semantic chunking achieved higher retrieval recall but only 54% overall accuracy versus 69% for recursive chunking.

The working memory analogy

Humans group information into meaningful units to overcome strict working-memory limits. Models have a structural version of the same constraint — but applied to position, not just capacity.

Cognitive science research frames chunking as the mechanism that lets humans hold more in working memory: by treating complex structures (functions, design patterns, modules) as single units, experts dramatically expand their effective capacity. Expert engineers use two-fold explicit-implicit decomposition to break hard problems into coherent sub-problems — a skill that transfers directly to partitioning long documents into retrieval-ready units.

The analogy is imperfect but instructive: chunk at the level of meaningful unit, not at a fixed byte count.

Position bias: the lost-in-the-middle problem

Long-context models do not process all positions equally. MIT and ACL research demonstrates a U-shaped performance curve: information at the beginning (primacy bias) and end (recency bias) of the context window performs significantly better than information in the middle. Performance on middle-positioned content degrades by over 30%.

This is not a memory failure. It is an architectural property:

  • RoPE decay: rotary position embeddings introduce long-term decay, causing models to systematically de-emphasise middle tokens.
  • Attention sinks: certain attention heads disproportionately attend to the first token, amplifying primacy bias.
  • Longer context windows from larger chunks increase exposure to middle-position degradation.
Position bias compounds with chunk size

Retrieving a larger chunk to preserve context may push critical information into the middle of the window — trading one problem for another. Knowing about position bias should influence both chunking strategy and how you assemble the final context window.


Compare & Contrast

Chunking strategy decision matrix

StrategyComputation at ingestTypical sizeStrengthsWeaknesses
Fixed / recursiveZero embedding calls512 tokens + 50–100 token overlapFast, scalable, benchmark-validated defaultSplits at arbitrary token boundaries, not meaning boundaries
Sentence-levelZero embedding calls~1–3 sentencesPreserves natural boundaries, competitive with semantic up to ~5,000 tokensPoor on long-narrative tasks (NarrativeQA: 4.2% recall@1)
Semantic200–300 embedding calls per 10K-word doc~43 tokens avgTheoretically ideal boundary alignmentHigh ingest cost; fragmented chunks hurt generation quality; recall gains ≤9%
Hierarchical (parent-child)Zero embedding callsChild: sentence; Parent: paragraph/sectionBalances precision and context without semantic analysisImplementation complexity; framework support required

Key finding from NAACL 2025: fixed-size recursive chunking at 512 tokens matched or exceeded semantic chunking on realistic documents while requiring zero embedding overhead.

Recursive vs. hierarchical retrieval

These are not mutually exclusive, but they solve different problems.

Recursive chunking is about how you segment. You split on natural delimiters — paragraphs, headings, code blocks — and fall back to token limits. The result is structurally aware without semantic analysis.

Hierarchical retrieval is about how you retrieve and assemble. You index small child chunks (sentences) for query matching, but when a child chunk is retrieved, you surface the broader parent chunk to the LLM. The child chunk finds the right place; the parent chunk gives the model enough context to reason.

Parent-child retrieval directly resolves the precision-context tradeoff that semantic chunking tries to solve with embedding analysis — but without the ingestion cost.


Key Principles

1. Match chunk size to query type and embedding model, not to convention.

Research across six datasets and five embedding models shows that optimal chunk size varies dramatically. Fact-based QA with entity embeddings (Snowflake-style) performs best at 64–128 tokens. Complex analytical retrieval with contextual embeddings (Stella-style) performs best at 512–1,024 tokens. There is no universal optimal. The embedding model selection alone creates as much influence on retrieval quality as the chunking configuration.

2. Default to recursive 512-token chunking with overlap unless you have evidence to do otherwise.

Benchmark data from February 2026 testing seven strategies across 50 academic papers found recursive 512-token splitting as the top performer at 69% accuracy. It requires zero embedding calls, is deterministic, and scales linearly. It is the right null hypothesis.

3. Use hierarchical retrieval when precision-context tradeoffs matter.

If retrieval recall is the goal but generation quality keeps degrading, parent-child chunking is the highest-leverage intervention. Match small units to queries; surface large units to the model.

4. Place critical information at the beginning or end of the context window.

Given position bias, the assembly order of retrieved chunks matters as much as which chunks you retrieve. Put the highest-relevance content first or last. If the model must reason across multiple retrieved passages, the most important one should not be sandwiched in the middle.

5. Decompose documents the way experts decompose problems.

Expert engineers use explicit problem decomposition to partition complexity into coherent sub-problems before solving. The same discipline applied to documents — chunking at logical, meaningful boundaries rather than byte boundaries — is what separates principled context engineering from naive text splitting.


Worked Example

Scenario: You are building a RAG system over a technical API reference (~200,000 tokens of markdown). Users ask a mix of questions: "What parameters does CreateUser accept?" (factoid) and "How should I handle rate limiting across multiple endpoints?" (analytical, cross-document).

Step 1 — Audit query types. You have two distinct query patterns requiring different retrieval behaviour. Factoid queries want short, precise chunks. Analytical queries need richer context.

Step 2 — Choose a hierarchical strategy. Split the reference at the heading level for parent chunks (full endpoint sections, ~500–800 tokens) and at the paragraph level for child chunks (~100–150 tokens). Index only child chunks for vector retrieval. When a child chunk is retrieved, return the parent.

Step 3 — Apply recursive splitting as a fallback. For sections that exceed 1,024 tokens (e.g., a long authentication overview), apply recursive splitting at 512 tokens with 100-token overlap. This caps the maximum context injected and keeps generation quality stable.

Step 4 — Assemble the context window deliberately. Sort retrieved chunks by relevance score. Place the highest-scoring chunk first. If the final answer requires synthesis across three chunks, do not interleave boilerplate — put the three relevant sections at the top, followed by any supporting material.

Step 5 — Test across query types. Run representative queries from each category against your chunking config. If factoid queries degrade, your child chunks may be too large. If analytical queries produce shallow answers, your parent chunks may be too small or are being dropped.

No single configuration survives contact with all query types

Plan for iterative calibration. The right approach is to start with a defensible default (recursive 512-token), instrument retrieval recall and generation quality separately, then adjust.


Boundary Conditions

When sentence-level chunking breaks down. Simple sentence splitting performs well for documents under approximately 5,000 tokens but degrades severely on narrative or long-form content. On NarrativeQA, sentence-level chunks (64–128 tokens) achieve only 4.2% recall@1 — the text requires multi-paragraph context that sentences cannot provide. Do not use sentence chunking on long-form documents where answers span paragraphs.

When semantic chunking is worth its cost. Semantic chunking can improve recall by up to 9% in scenarios where semantic coherence is critical and document structure does not map cleanly to meaningful units (e.g., continuous prose without headings, transcriptions, or legal text with complex cross-references). If you have offline ingestion pipelines where the embedding overhead is a one-time cost and retrieval precision matters more than speed, semantic chunking is a legitimate choice.

When position bias mitigation is not enough. Placing important chunks at the start or end of the context window helps, but architectural factors (RoPE decay, attention sinks) cannot be fully compensated by reordering alone. For critical analytical tasks requiring tight cross-reference across many retrieved chunks, consider decomposing the retrieval into multiple focused model calls rather than loading everything into one window.

When hierarchy adds cost without benefit. Parent-child retrieval requires framework support (LangChain, Dify, or custom implementation). For simpler applications where queries are narrow and factoid, the implementation overhead of hierarchical retrieval is not justified. Recursive chunking at 512 tokens remains the appropriate default for most workloads.

Key Takeaways

  1. Chunking strategy determines whether retrieved content is usable, not just retrievable. Very small chunks improve retrieval recall but degrade generation quality; the optimal chunk size depends on the embedding model, dataset, and query type.
  2. Recursive 512-token chunking with 50–100 token overlap is the benchmark-validated default. It requires zero embedding overhead at ingest and consistently outperforms semantic chunking in overall accuracy, not just retrieval recall.
  3. Hierarchical parent-child retrieval resolves the precision-context tradeoff without semantic analysis. Index small chunks for matching; surface parent chunks to the model. This is the highest-leverage structural improvement to retrieval quality.
  4. Position bias is architectural, not accidental. Performance degrades by over 30% for content in the middle of the context window. Assemble the context window deliberately: critical information at the beginning or end.
  5. Context engineering is decomposition work. The same expert discipline that breaks complex engineering problems into coherent sub-problems applies to partitioning documents. Chunk at the level of meaningful unit, not at a fixed byte count.

Further Exploration

Research Papers

Implementation & Practice