Transformer
The neural-network architecture (Vaswani et al., 2017) that powers virtually every modern LLM, based on self-attention instead of recurrence.
The Transformer, introduced in the 2017 paper "Attention Is All You Need", replaced recurrent networks with a self-attention mechanism that lets every token in a sequence attend to every other token in parallel. This parallelism is what made training on the scale of modern LLMs possible.
A Transformer is built from stacked attention + feed-forward blocks with residual connections and layer normalisation. Variants include decoder-only (GPT, Claude, Llama), encoder-only (BERT) and encoder-decoder (T5). Today's frontier LLMs are almost all decoder-only Transformers, often combined with mixture-of-experts (MoE) layers for efficiency.
Related terms
- Large Language Model (LLM)
A neural network trained on massive text corpora to predict the next token, used for chat, coding, reasoning and as the brain inside AI agents.
- Tokens
The atomic units that LLMs read and write — sub-word pieces produced by a tokenizer. Pricing and context limits are measured in tokens, not words.
More to explore
Other wiki entries that touch on Transformer.