Retrieval-Augmented Generation (RAG)
Also known as: RAG, Grounded generation
A technique that retrieves relevant documents at query time and feeds them into the LLM's prompt, so the model can answer from your data instead of memorising it.
Retrieval-augmented generation (RAG) is the standard pattern for letting an LLM answer questions about private or up-to-date content without retraining. At index time, documents are chunked and embedded into a vector database. At query time, the user's question is embedded, the nearest chunks are retrieved, and those chunks are inserted into the prompt as context.
RAG is cheaper than fine-tuning, easier to update (just re-index the documents), and more transparent — you can show the sources the answer was based on. The main failure modes are bad chunking, weak embeddings, and stuffing too much irrelevant context into the prompt.
Modern RAG systems add re-ranking, hybrid keyword + vector search, query rewriting, and tool-style retrieval inside an agent loop.
Related terms
- Large Language Model (LLM)
A neural network trained on massive text corpora to predict the next token, used for chat, coding, reasoning and as the brain inside AI agents.
- Context Window
The maximum number of tokens an LLM can read in a single request, including the prompt, retrieved documents and the model's own reply.
- Fine-Tuning
Continuing to train a pre-trained model on your own data to specialise its behaviour, tone or domain knowledge.