Lost in the Middle: Position Bias in Long-Context LLMs
Liu et al.'s 2023 "Lost in the Middle" paper (TACL 2024) showed that language models given long contexts attend best to information at the start and end of the input, with accuracy tracing a U-shaped curve as the relevant passage moves toward the middle. The effect appears across GPT-3.5, Claude, LongChat, and MPT, persists in extended-context variants, and is widely attributed to rotary position embeddings and causal attention. The finding drove practical changes in RAG pipelines — re-ranking to place top hits at the edges, repeating key instructions, and using benchmarks like Needle in a Haystack to measure how well models actually use their advertised context windows.
"Lost in the Middle: How Language Models Use Long Contexts," a 2023 paper by Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang (published in TACL 2024), documented a now-canonical failure mode of large language models: when relevant information is buried in the middle of a long input, accuracy drops sharply compared to placing the same information at the very beginning or end. Plotted as a function of the relevant document's position, retrieval accuracy traces a characteristic U-shape — high at the edges, sagging in the middle. The authors evaluated open and closed models including GPT-3.5, Claude, LongChat, and MPT on two tasks: multi-document question answering, where a single answer-bearing passage is hidden among up to 30 distractor documents, and synthetic key-value retrieval. For multi-document QA, performance could fall by more than 20 percentage points as the gold document moved from the first or last position into the middle of the context. The effect appeared even in base models without instruction tuning, suggesting it is not primarily an artifact of RLHF or chat fine-tuning. Crucially, simply enlarging the context window does not fix the problem. Extended-context variants (e.g., 16K versions of models with 4K bases) showed the same U-shape, and later follow-up work confirmed the pattern persists in models advertising 100K+ token windows. Proposed mechanisms include long-term decay in Rotary Position Embedding, attention sinks that anchor on the first tokens, and recency bias from causal attention — all of which privilege the start and end of the sequence over its middle. The findings reshaped practical guidance for Retrieval-Augmented Generation (RAG) pipelines and long-document summarization. Recommended mitigations include re-ranking retrieved chunks so the highest-scoring ones sit at the edges of the prompt, repeating critical instructions at both the top and bottom, keeping contexts shorter when possible, and using "lost-in-the-middle-aware" prompt templates. More involved approaches include positional attention calibration (the "Found in the Middle" method, 2024) and specialised continued pretraining such as IN2 training. Benchmarks like the Needle in a Haystack Benchmark have become standard tools for tracking whether new long-context models actually use their advertised window, with most still showing measurable position sensitivity even as average scores improve.