Transformer

The neural-network architecture (Vaswani et al., 2017) that powers virtually every modern LLM, based on self-attention instead of recurrence.

The Transformer, introduced in the 2017 paper "Attention Is All You Need", replaced recurrent networks with a self-attention mechanism that lets every token in a sequence attend to every other token in parallel. This parallelism is what made training on the scale of modern LLMs possible.

A Transformer is built from stacked attention + feed-forward blocks with residual connections and layer normalisation. Variants include decoder-only (GPT, Claude, Llama), encoder-only (BERT) and encoder-decoder (T5). Today's frontier LLMs are almost all decoder-only Transformers, often combined with mixture-of-experts (MoE) layers for efficiency.

Related terms

More to explore

Other wiki entries that touch on Transformer.