Context Window

The maximum number of tokens an LLM can read in a single request, including the prompt, retrieved documents and the model's own reply.

The context window is the hard limit on how much text an LLM can attend to at once. It is measured in tokens and shared between input and output: a 200k context model with a 195k-token prompt only has room for ~5k tokens of reply.

Context windows have grown fast — from 4k (GPT-3.5 in 2022) to 200k (Claude), 400k (GPT-5) and 1M+ (Gemini 2.5 Pro, some open models). Larger context lets you feed entire codebases or long documents directly, often replacing RAG for medium-sized corpora.

Longer is not always better: cost and latency scale roughly linearly with input tokens, and models still suffer from 'lost in the middle' — facts buried deep in a long prompt are recalled less reliably than facts near the start or end.

See also on SoftPerceptron

Related terms

More to explore

Other wiki entries that touch on Context Window.