GEO

Context Rot

Context rot is the gradual decline in an LLM's accuracy, instruction following, and citation faithfulness as the input context gets longer. Even with context windows reaching 1M tokens, the practically usable accuracy collapses well before then — the difference between 32k, 128k, and 1M is far smaller than the marketing implies.

Context rot is the gradual decline in an LLM's accuracy, instruction following, and citation faithfulness as the input context gets longer. Even with context windows reaching 1M tokens, the practically usable accuracy collapses well before then — the difference between 32k, 128k, and 1M is far smaller than the marketing implies.

Why It Matters

Benchmarks advertise million-token windows, but empirical research from 2025 onward paints a different picture — evaluations from Chroma, Anthropic, and Databricks consistently show the same model dropping from 95% accuracy at 8k to roughly 60% at 64k on identical tasks. In retrieval-augmented generation (RAG), dumping 30 chunks at once typically uses only the first and last few while ignoring the middle (lost-in-the-middle), and the model may even claim to have "consulted" content it never actually used. Context rot is the largest hidden trap in GEO and RAG system design, and it directly contradicts the intuition that "bigger context = better answers."

The Symptoms

Middle information ignored: Critical facts placed in the middle of the context don't make it into the answer, while content at the start and end survives.

Instruction drift: System-prompt directives start being ignored after a long user message — tone, format, and prohibitions all leak.

Citation hallucination: The model says "according to the fifth paragraph above..." but no such paragraph exists, or the content came from a different document.

Retention collapse: In multi-turn conversations, early context is effectively forgotten. After 4–5 turns, the model loses track of prior agreements.

Tool-call dropout: Tools defined in long contexts get used less often, or get called with the wrong arguments.

Why It Happens

Attention dilution: Every token has to attend to every other token, so the per-token signal weakens as the sequence lengthens.

Positional encoding limits: Beyond the trained length, position information loses meaning. RoPE and ALiBi help, but don't fully solve it.

Training data distribution: Most documents seen during training are short. A 1M-token window doesn't mean the model was trained on 1M-token documents.

Needle-in-haystack limits: Simple lookup tasks pass even at long context, but reasoning, synthesis, and multi-fact integration degrade much faster.

Implications for GEO

Answer engines retrieve, chunk, and synthesize, stacking the retrieved chunks into the LLM's context to generate the answer. Context rot means:

Top-ranked chunks dominate: If your chunk doesn't make it into the top 1–3 after reranking, it effectively isn't cited even though it's "in the context."

Short, self-contained chunks win: Longer chunks dilute attention. 100–300 words is the sweet spot.

Direct-answer openings matter: A first paragraph that answers the question survives regardless of where it sits in the context.

Citation faithfulness needs verification: Answers can hallucinate citations that look grounded; post-processing checks are necessary.

Mitigation Strategies

Context compression: Don't drop raw documents into the context — use query-aware summarization to extract just the relevant parts.

Aggressive reranking: Retrieve 30–50 candidates, rerank to the top 5–10, then put those in the context.

Position critical info deliberately: Place the most important chunks at the beginning or end (avoid the middle).

Hierarchical synthesis: Map-reduce style — synthesize sub-groups of chunks, then synthesize the summaries.

Set a context budget: Cap context at, say, 8k tokens deliberately and optimize within that.

Automated RAG evaluation: Verify factual alignment between answers and source chunks via LLM-as-judge or embedding similarity.

Common Misconceptions

"Bigger context is always better": Advertised window ≠ usable window. The reliable practical limit is roughly 10–30% of the stated capacity.

"Passing needle-in-a-haystack means long context works": Single-fact lookup is easy. Multi-fact reasoning collapses much earlier.

"Fine-tuning fixes it": Fine-tuning helps somewhat but the structural limits remain. System design is a more effective workaround.

"New models have solved it": As of 2026, even frontier models still measurably degrade past 32k–64k tokens.

Common Mistakes

Dumping all retrieval results into context: Pasting top-30 chunks raw guarantees lost-in-the-middle.

Putting the system prompt at the end: System instructions placed after a long user message get ignored. Put them at the start.

Trusting context-window marketing: A 1M-token ad does not mean 1M usable tokens.

Skipping RAG validation: If "looks grounded" is the bar, hallucinations accumulate.

Uniform chunk sizes: Cutting all documents to identical length breaks meaning. Use semantic chunking.

Sources: