Retrieval-Augmented Generation (RAG)

Also known as: RAG, Grounded generation

A technique that retrieves relevant documents at query time and feeds them into the LLM's prompt, so the model can answer from your data instead of memorising it.

Retrieval-augmented generation (RAG) is the standard pattern for letting an LLM answer questions about private or up-to-date content without retraining. At index time, documents are chunked and embedded into a vector database. At query time, the user's question is embedded, the nearest chunks are retrieved, and those chunks are inserted into the prompt as context.

RAG is cheaper than fine-tuning, easier to update (just re-index the documents), and more transparent — you can show the sources the answer was based on. The main failure modes are bad chunking, weak embeddings, and stuffing too much irrelevant context into the prompt.

Modern RAG systems add re-ranking, hybrid keyword + vector search, query rewriting, and tool-style retrieval inside an agent loop.

Related terms