Needle in a Haystack Benchmark
Long-context stress test popularised by Greg Kamradt in 2023 that hides a single out-of-place fact (the needle) at varying depths inside a much larger document (the haystack) and asks a model to retrieve it. By sweeping depth and total context length, the benchmark produces a heatmap of retrieval accuracy that directly visualises the lost-in-the-middle effect. Now standard for long-context launches, but criticised as single-hop and extractive; follow-ups like RULER, LongBench, and Multi-Needle add multi-hop and reasoning components.
The Needle in a Haystack (NIAH) benchmark is a widely used stress test for the long-context abilities of large language models, popularised by independent researcher Greg Kamradt in late 2023. The test inserts a short, out-of-place statement (the "needle") at a chosen depth inside a much larger body of unrelated text (the "haystack," typically essays by Paul Graham), then asks the model a question whose answer is the needle. By sweeping the needle through many depths (0% to 100% of the document) and many total context lengths (from a few thousand tokens up to the model's stated maximum), the benchmark produces a 2D heatmap of retrieval accuracy. Successful models show a uniformly green grid; failure modes appear as bands of lower accuracy, often near the middle of long contexts — a direct visualisation of the lost-in-the-middle effect. NIAH became the de facto demonstration accompanying long-context launches by Anthropic, OpenAI, and Google DeepMind, but it has well-known limitations. The task is purely extractive and single-hop, so high scores can mask much weaker multi-document reasoning. Follow-ups such as Multi-Needle in a Haystack (LangChain), RULER, LongBench, and InfiniteBench introduce multiple needles, distractors, aggregation, and reasoning to give a more realistic picture of how models use long inputs. NIAH is now best understood as a necessary but far-from-sufficient probe of long-context capability.