Lost in the Middle
"Lost in the middle" is the empirical finding — documented by Liu et al. in a 2023 Stanford/Samaya AI paper — that LLMs perform best when key information is at the very beginning or very end of a long context, and noticeably worse when the same information sits in the middle. Even models with 100K+ token windows still exhibit this U-shaped attention curve.
"Lost in the middle" is the empirical finding — documented by Liu et al. in a 2023 Stanford/Samaya AI paper — that LLMs perform best when key information is at the very beginning or very end of a long context, and noticeably worse when the same information sits in the middle. Even models with 100K+ token windows still exhibit this U-shaped attention curve.
Why It Matters
"Large context window" is not the same as "reads everything equally." A model with a 200K context can technically ingest an entire book, but the practical accuracy on a question whose answer sits on page 300 of a 500-page PDF is much worse than the same question answered on page 5 or page 495. For builders, this has concrete consequences: how you order context inside a prompt changes answer quality dramatically, often more than how much context you provide. Most production RAG failures caused by "the model ignored the retrieved passage" are actually lost-in-the-middle failures in disguise.
The Original Finding
Liu et al.'s 2023 paper "Lost in the Middle: How Language Models Use Long Contexts" tested GPT-3.5, GPT-4, Claude, and several open models on multi-document question answering. For each question, they shuffled the relevant document to positions 1, 5, 10, 15, 20 out of 20 total documents. Results:
- Accuracy was highest when the relevant document was first (at the top of the context).
- Accuracy was nearly as high when it was last (at the bottom).
- Accuracy dropped by 20–30 points when the relevant document sat in the middle positions.
The shape looks like a U: strong at both ends, weak in the middle. Subsequent work has shown this pattern holds on Claude, Gemini, and Llama models even as their context windows grew.
Why It Happens
Several hypotheses, likely all partially true:
Training data distribution: Training data tends to put important information at beginnings (headlines, topic sentences) and ends (conclusions, TL;DRs). The model learns those positional priors.
Attention decay: Self-attention's effective range degrades over very long sequences even with techniques like RoPE or ALiBi — distant middle tokens get less attention mass than nearby ends.
Positional encoding limits: Extended context models inherit position encodings that were tuned for shorter sequences, so middle positions are relatively under-trained.
Recency bias: Models weight recent tokens more, which amplifies the strong end but doesn't help the middle.
How to Design Around It
1. Put the most important context first or last: For RAG, place the top-ranked retrieved passage at the very start or very end of the context block.
2. Reranking after retrieval: Use a reranker to sort retrieved chunks by relevance, then put the top one at the edge.
3. Reorder by relevance, not retrieval order: Vector search often returns results in distance order; reorder so the most relevant end up in high-attention positions.
4. Summarize the middle: Instead of dumping raw middle context, summarize it and place the summary at the top. A compressed middle survives better than a raw one.
5. Shorten the context: The U-curve gets worse as length grows. Fewer, more-relevant chunks beat many marginal ones.
6. Repeat critical facts: Putting the same key fact at both the top and bottom exploits the U-curve instead of fighting it.
7. Task instruction at both ends: Some prompts benefit from repeating the question at the top and bottom of the context, sandwiching the evidence.
Does This Still Apply in 2026?
Newer long-context models (Gemini 1.5 / 2.0, Claude 3.5+/4.x, GPT-4 Turbo and o-series) have improved middle-of-context recall considerably. Needle-in-a-haystack tests on Gemini 2.0 show near-perfect retrieval across the whole window. But in real-world multi-fact tasks with complex reasoning, the U-shape still shows up — just less dramatically. The practical advice hasn't changed much: shorter, well-ordered context still beats long, randomly ordered context.
Common Mistakes
Assuming bigger context = better answers: Only true up to a point; middle degradation kicks in.
Dumping retrieved passages in vector-search order: Vector distance doesn't equal positional importance.
Skipping reranking: Retrieval + rerank is more effective than longer context with naive retrieval.
Not testing with needles in realistic positions: Toy "needle in haystack" tests often place the needle in uniform random positions, which hides the U-curve. Test on realistic use cases.
Believing the marketing: "1M token context" doesn't mean the model treats all 1M tokens equally.
Sources: