Retrieval-Augmented Generation
A technique that enhances large language model outputs by retrieving relevant documents from an external knowledge base at inference time, grounding responses in up-to-date and domain-specific information.
Retrieval-Augmented Generation (RAG) is an AI framework that combines the parametric knowledge stored in a large language model (LLM) with non-parametric information retrieved from an external corpus at query time. Rather than relying solely on patterns learned during pre-training, a RAG system dynamically fetches relevant documents or passages and includes them in the prompt context provided to the model, allowing it to generate responses grounded in specific, current, or proprietary data.[^1]
Background
Large language models acquire knowledge by training on vast corpora, but this knowledge is frozen at the training cutoff date. Models may also hallucinate — producing plausible-sounding but factually incorrect statements — particularly when asked about niche topics, recent events, or proprietary information not present in the training data. RAG addresses these limitations without the expense and complexity of retraining or fine-tuning the model.
The foundational paper by Lewis et al. at Meta AI (then Facebook AI Research), published in 2020, demonstrated that augmenting a pre-trained language model with a dense retrieval component improved performance on open-domain question answering tasks substantially.[^2]
How RAG Works
A RAG pipeline operates in four broad stages.
Ingestion involves processing a document corpus into retrievable chunks. Documents are split into segments (often a few hundred tokens each), and each segment is converted into a dense vector representation — an embedding — using a sentence encoder or another embedding model. These vectors are stored in a vector database alongside the original text.
Retrieval occurs when a user submits a query. The query is encoded into the same vector space as the stored documents. The vector database performs an approximate nearest-neighbour search, returning the top-k document chunks whose embeddings are most similar to the query embedding. This similarity search operates in milliseconds even across millions of stored vectors.
Augmentation combines the retrieved document chunks with the original user query into a structured prompt. This prompt is passed to the LLM, providing it with relevant context. The amount of retrieved content is limited by the model's context window size, though this constraint has relaxed significantly as models now support contexts of 128,000 tokens and beyond.
Generation is the final step in which the LLM reads the augmented prompt and produces a response, drawing on both the retrieved context and its trained knowledge. Citations can be extracted by identifying which retrieved passages influenced the output.
Retrieval Strategies
Several retrieval approaches exist, each with different trade-offs:
| Strategy | Description | Best for | |----------|-------------|----------| | Dense retrieval | Nearest-neighbour search in embedding space | Semantic similarity, paraphrase matching | | Sparse retrieval | BM25 keyword-based ranking | Exact term matching, named entities | | Hybrid retrieval | Combining dense and sparse scores | Balanced precision and recall | | Reranking | Cross-encoder scoring of top-k candidates | High-stakes accuracy, small latency budget |
Hybrid retrieval, which combines vector similarity with keyword search, has become the default in production RAG systems because it handles both semantically phrased queries and specific technical terms reliably.
Advanced RAG Patterns
Beyond the basic pipeline, several design patterns have emerged for more demanding applications. Corrective RAG (CRAG) evaluates the quality of retrieved documents and falls back to web search if the local retrieval is insufficient. Self-RAG introduces a reflection step where the model decides whether to retrieve, judges the relevance of retrieved documents, and critiques its own generated output. GraphRAG, developed by Microsoft Research, constructs a knowledge graph over the document corpus rather than raw vector embeddings, enabling multi-hop reasoning over structured relationships.[^3]
Comparison with Fine-Tuning
RAG and fine-tuning are complementary rather than competing approaches to specialising an LLM for a domain.
Fine-tuning updates the model's weights to encode domain knowledge, improving the model's general behaviour in that domain but requiring periodic retraining as knowledge evolves. It is effective for learning communication styles, domain terminology, and consistent output formats.
RAG keeps the base model unchanged and provides knowledge at inference time, making it easier to update the knowledge base without touching the model. It is more suitable when the underlying information changes frequently, when strict source attribution is required, or when the knowledge base is too large to encode into model weights.
In practice, many production systems employ both: a fine-tuned model paired with a RAG retrieval layer.
Infrastructure Requirements
A production RAG system requires several components: an embedding model to vectorise documents and queries, a vector database (such as Pinecone, Weaviate, Qdrant, or pgvector) for storage and retrieval, a chunking strategy to segment documents appropriately, and an LLM capable of synthesising retrieved context. Orchestration frameworks such as LangChain and LlamaIndex provide pre-built abstractions for wiring these components together.
See Also
References
References
- IBM. (2024). What is Retrieval-Augmented Generation (RAG)? IBM Think. https://www.ibm.com/think/topics/retrieval-augmented-generation
- Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33.
- Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., & Larson, J. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research.
- Pinecone. (2025). Retrieval-Augmented Generation: A Technical Overview. https://www.pinecone.io/learn/retrieval-augmented-generation/