What Is a Transformer? | GEO Glossary

The Transformer is the deep learning architecture introduced in Google's 2017 paper "Attention Is All You Need." Through self-attention, every element of an input sequence references every other to build context. Every major LLM in 2026 — GPT, Claude, Gemini, Llama — runs on a variant of the Transformer.

Why It Matters

RNNs and LSTMs that preceded the Transformer lost context over long sentences and were hard to parallelize, which limited large-scale training. The Transformer fixed both at once and opened the era of "AI scaling." Today's ChatGPT and Claude search experiences all exist because of it. Understanding the architecture is the foundation for grasping why LLMs cite some content well and miss other content.

Core Mechanics

Self-attention: Every word in a sentence computes a relevance score with every other word. In "The company picked inblog, and they tripled their blog traffic," self-attention figures out that "they" refers to "the company," not "inblog."

Multi-head attention: Multiple attention heads run in parallel, each learning a different type of relationship (syntactic, semantic, positional).

Positional encoding: Because attention itself has no ordering, position vectors are injected so the model knows word order.

Feed-forward layers: Each position's representation is enriched through non-linear transformations.

Layer stacking: Dozens to hundreds of Transformer blocks stacked to learn deep contextual representations.

Main Variants

Encoder-only (BERT, RoBERTa): Bidirectional understanding of the input. Strong for classification and embeddings. Google Search's BERT ranking is in this family.

Decoder-only (GPT, Claude, Llama): Left-to-right next-token prediction, optimized for generation. Most 2026 LLMs are decoder-only.

Encoder-decoder (T5, BART): Good for tasks that first understand the input, then generate a new output — translation, summarization.

Sparse attention and Mixture-of-Experts: Reduce the compute cost of long contexts and large models by computing only a subset. Used in frontier models like Claude Opus 4.8 and Gemini 3.5.

Limitations

Quadratic complexity: Standard self-attention is O(n²) in sequence length. At 1M-token contexts the math explodes — the reason optimizations like FlashAttention and linear attention exist.

Lost in the middle: Very long contexts weaken the model's attention on middle content. That's why you front- and back-load key information in your writing.

Hallucinations: Because the Transformer generates from learned patterns, it can confidently answer outside the training distribution.

Black-box nature: Attention scores are partially interpretable, but real decision processes remain hard to explain.

GEO Implications

Transformer-based LLMs process content differently from how classic SEO thinks about it.

Contextual consistency: Because attention learns word-to-word relationships, paragraphs with clearly linked pronouns, referents, and topic words get understood better.

Explicit topic words: Attention rewards consistent self-reference of key terms. Natural repetition of the main keyword throughout a section sharpens the topic signal.

Start and end matter: Given the "lost in the middle" effect, put key information at the start and end of a post.

Structural markers: Attention uses ### headings, lists, and tables as semantic boundaries. Well-structured content parses better.

Sources: