Context Window
The maximum number of tokens an LLM can read in a single request, including the prompt, retrieved documents and the model's own reply.
The context window is the hard limit on how much text an LLM can attend to at once. It is measured in tokens and shared between input and output: a 200k context model with a 195k-token prompt only has room for ~5k tokens of reply.
Context windows have grown fast — from 4k (GPT-3.5 in 2022) to 200k (Claude), 400k (GPT-5) and 1M+ (Gemini 2.5 Pro, some open models). Larger context lets you feed entire codebases or long documents directly, often replacing RAG for medium-sized corpora.
Longer is not always better: cost and latency scale roughly linearly with input tokens, and models still suffer from 'lost in the middle' — facts buried deep in a long prompt are recalled less reliably than facts near the start or end.
See also on SoftPerceptron
Related terms
- Large Language Model (LLM)
A neural network trained on massive text corpora to predict the next token, used for chat, coding, reasoning and as the brain inside AI agents.
- Tokens
The atomic units that LLMs read and write — sub-word pieces produced by a tokenizer. Pricing and context limits are measured in tokens, not words.
- Retrieval-Augmented Generation (RAG)
A technique that retrieves relevant documents at query time and feeds them into the LLM's prompt, so the model can answer from your data instead of memorising it.
More to explore
Other wiki entries that touch on Context Window.
- AI Agent
An LLM-based system that can plan, use tools and take multi-step actions toward a goal — not just answer a single prompt.
- Multimodal AI
AI models that natively accept and/or produce more than one modality — text, image, audio, video — in a single model.
- Model Context Protocol (MCP)
An open standard from Anthropic for connecting LLMs to external tools and data sources through a uniform server interface.