Engineering

Retrieval-Augmented Generation

How grounding language models in external knowledge became the backbone of production AI

Lead Summary

Retrieval-Augmented Generation (RAG) is an AI architecture that couples a language model with an external knowledge store, allowing the model to ground its outputs in retrieved documents rather than relying solely on parameters learned during training. Introduced by Lewis et al. in 2020 as a solution to knowledge-intensive NLP tasks, RAG has become the dominant pattern for building AI systems that need access to fresh, domain-specific, or verifiable information.

The appeal is direct: a parametric model's knowledge is frozen at training time, whereas RAG systems can consult a living corpus at inference time. RAG models generate more specific, diverse, and factual language than parametric-only baselines, and set state-of-the-art performance on multiple open-domain question-answering benchmarks at the time of introduction. By 2025-2026, RAG is widely described as a key architecture for maintaining accuracy and authority in AI-assisted literature synthesis across clinical and academic domains.

Core Concepts

The Three-Stage Pipeline

The RAG pipeline is fundamentally structured as three sequential stages, a decomposition established in the original Lewis et al. paper and still the core conceptual framework across industry and academic implementations:

  1. Retrieve — fetch relevant documents or passages from an external corpus in response to the user query.
  2. Augment — inject the retrieved documents into the model's prompt as additional context.
  3. Generate — produce a grounded response that draws on both parametric knowledge and the retrieved material.

This architecture couples neural retrieval with generative models, grounding output in non-parametric memory while retaining the semantic generalization ability of the underlying language model.

Dense Passage Retrieval

The Lewis et al. paper introduced Dense Passage Retrieval (DPR) as the foundational retrieval method. DPR uses a dual-encoder architecture: independent transformer models encode queries and documents into dense vectors, and retrieval is performed by approximate nearest-neighbor search over the document vector index. While newer embedding models have superseded the original DPR weights, the dual-encoder pattern remains the architectural blueprint for most contemporary dense retrieval systems.

Mechanism & Process

Chunking

Before a corpus can be indexed for retrieval, documents must be segmented into smaller retrieval units through a process called chunking. Chunking strategy is a critical determinant of RAG system quality: fixed-size, semantic, adaptive, and hierarchical strategies each produce different retrieval accuracy profiles across query types and domains.

Advanced chunking methods substantially outperform naive baselines. In clinical decision-support settings, semantic and adaptive chunking achieves 87% accuracy versus only 13% for basic fixed-size approaches, as documented in a study comparing chunking strategies for RAG in clinical applications.

Hierarchical chunking

One particularly effective pattern is parent-child chunking: small child chunks (individual sentences) are used to match user queries with high precision, while parent chunks (paragraphs or sections) are fed to the language model to provide broader narrative context. This balances retrieval precision with contextual completeness without requiring semantic analysis of the document structure. Modern RAG frameworks like Dify implement this as a first-class feature.

Hybrid Retrieval and Rank Fusion

Production RAG systems frequently combine dense vector search (embedding-based) with sparse keyword search (BM25-style) to capture both semantic and lexical signals. The challenge is combining rankings from two disparate scoring systems whose raw scores are not directly comparable.

Reciprocal Rank Fusion (RRF) solves this by operating on rank positions rather than raw scores. Each document receives a score of 1 / (k + rank) for each retriever (where k is typically 60), and scores are summed across retrievers to produce a unified ranking. RRF is parameter-free, theoretically grounded, and has become the standard fusion method in production hybrid search systems precisely because it avoids the calibration problems of score normalization.

By operating on rank positions rather than raw retrieval scores, RRF avoids the calibration problems that arise when combining disparate scoring systems like cosine similarity and TF-IDF.

Variants & Subtypes

Modular and Agentic RAG

The original RAG design is a static linear pipeline. By 2025-2026, the frontier has shifted to modular and agentic architectures. Modular agentic RAG decomposes the pipeline into specialized, interchangeable components — query planners, retrievers, rerankers, answer generators — orchestrated by a central controller or agent. This enables dynamic routing, conditional retrieval strategies, and compositional reasoning, making it better suited to complex, knowledge-intensive tasks where a single retrieval pass is insufficient.

Self-RAG: Adaptive Retrieval

Self-RAG is a variant in which the model itself decides whether to retrieve at all during generation. Rather than applying retrieval uniformly to every input, the model can skip retrieval for queries that do not require external knowledge, and can retrieve multiple times within a single generation. This adaptive mechanism optimizes for factuality when retrieval is needed and for efficiency when it is not — making Self-RAG more suitable for diverse downstream queries than static retrieval approaches.

Cache-Augmented Generation

Cache-Augmented Generation (CAG) is an alternative approach that inverts the traditional RAG design. Instead of retrieving relevant chunks at query time, CAG preloads the entire knowledge base into the model's context window and caches the resulting key-value (KV) representations. During inference, only the new query is processed; the model attends to the pre-cached knowledge base representations without any retrieval step.

CAG was introduced in a paper accepted at the ACM Web Conference 2025, establishing it as a peer-reviewed contribution to the retrieval and generation literature.

Controversies & Debates

The Retrieval Bottleneck

A significant practical tension in RAG systems is latency. Research from Redis quantifies the cost: retrieval operations — vector database queries, ranking, reranking — account for 41% of total end-to-end latency in RAG systems and 45-47% of time-to-first-token (TTFT) latency. A typical vector database query adds 50–300ms to the response pipeline.

This bottleneck is precisely what CAG is designed to eliminate. By precomputing and caching the KV representations of the entire knowledge base, CAG implementations avoid retrieval entirely at inference time, trading memory and storage for dramatically lower latency.

Fig 1
Retrieval (41%) Generation (38%) Other (21%)
Latency composition in a RAG pipeline

Retrieval Errors and Accuracy

Retrieval is not only slow — it can be wrong. Traditional RAG systems can select irrelevant chunks, miss relevant documents, or produce ranking errors that cause the model to hallucinate based on incomplete context. The original CAG paper argues that for knowledge bases small enough to fit in the context window, full-context attention beats retrieval on accuracy because the model sees relationships that RAG's chunking destroys.

The Limits of RAG Alone

Enterprise practitioners describe a characteristic tension: they "cannot live without RAG, yet remain unsatisfied." Conventional RAG provides no mechanism to verify whether retrieved context is actually useful, no ability to retry if retrieval misses the mark, and no capacity to pull from multiple sources or use external tools. Achieving stable, accurate results on complex queries requires extensive fine-tuning, increasing total cost of ownership. These limitations motivate the shift toward agentic RAG designs.

Notable Examples

CBR-RAG: Case-Based Reasoning Meets RAG

Researchers have integrated classical Case-Based Reasoning (CBR) with RAG frameworks to address tasks requiring structured domain knowledge and accountable decision-making. CBR-RAG systems retrieve contextually relevant past cases using similarity-based indexing, augmenting LLM queries with structured case examples rather than raw document passages. CaseGPT and similar systems combine LLM generation with structured case retrieval, demonstrating that RAG is not restricted to unstructured text corpora.

Implementation Notes

CAG: Practical Patterns

For knowledge bases that fit within the model's context window, CAG is operationally simpler than RAG: there is no vector database to maintain, no embedding model to run, no chunking strategy to tune, and no retrieval ranking system to configure. The knowledge base is processed once to produce KV cache tensors, which are saved to disk for reuse. Inference loads the cached tensors directly into the model's KV cache.

The constraint is hard: the entire knowledge base must fit within the LLM's maximum context window. CAG is unsuitable for very large or continuously growing knowledge bases that exceed this limit. For those cases, traditional RAG or agentic RAG remains the appropriate architecture.

Multiple open-source CAG implementations exist, including the reference implementation by Huang et al. and standalone systems like the dakshjain-1616 CAG System, providing working code for one-time cache computation and disk-based persistence.

Key Takeaways

  1. RAG couples language models with external knowledge stores to ground outputs in retrieved documents rather than relying solely on parameters learned during training. This solves the fundamental problem that parametric model knowledge is frozen at training time, whereas RAG systems can consult a living corpus at inference time.
  2. The RAG pipeline has three sequential stages: Retrieve relevant documents from an external corpus, Augment the prompt with retrieved context, and Generate a response grounded in both parametric and retrieved knowledge. This decomposition, established in the original Lewis et al. paper, remains the core conceptual framework across industry and academic implementations.
  3. Chunking strategy is a critical determinant of RAG system quality, with semantic and adaptive approaches substantially outperforming naive fixed-size baselines. In clinical settings, semantic and adaptive chunking achieves 87% accuracy versus only 13% for basic fixed-size approaches, demonstrating the importance of retrieval unit design.
  4. Retrieval operations account for 41% of total end-to-end latency in RAG systems, creating a significant bottleneck that drives alternative architectures like Cache-Augmented Generation. CAG eliminates runtime retrieval by precomputing and caching the KV representations of the entire knowledge base, trading memory for dramatically lower latency.
  5. Traditional RAG alone provides no mechanism to verify retrieved context usefulness or retry on misses, motivating the shift toward agentic RAG designs with dynamic routing and reasoning. Achieving stable, accurate results on complex queries requires extensive fine-tuning under conventional RAG, increasing total cost of ownership for enterprise systems.

Further Exploration

Foundational papers

Production implementation & operations