Context Rot in Long AI Coding Sessions: Why Agents Get Worse as Context Fills

Context rot is the documented degradation of language models as their context fills, driven by softmax-normalized attention spreading thinner over more tokens, and it is sharpest in agentic coding workloads with heavy tool-call churn.

Context rot is the term Anthropic uses for the documented degradation of model output quality as the context window fills with tokens. The underlying mechanism is mechanical: transformer attention is softmax-normalized across all in-context tokens, so as the context grows, the weight assigned to any single token shrinks and the effective noise floor of attention rises. Rot shows up worst in agentic coding workloads with heavy tool-call churn. GitHub issues anthropics/claude-code #34685 and #35296 document Opus 4.6 Claude Code users seeing real degradation at 20 to 40% context usage, with the model self-reporting ineffectiveness around 48% and recommending a session restart. The combination of accumulated tool outputs, repeated re-attention to the same files, and stateful artifacts that must be tracked across many iterations stresses long-context attention much harder than a flat read. Research and synthesis workloads are structurally different. Tool calls are additive — web searches and file reads return bounded chunks — each output section draws on a local slice of context, and there is no requirement to maintain a complex stateful artifact across fifty turns. Users running long research sessions can sit in the favorable zone of 1M context where degradation is mild; users running long agentic coding sessions hit the rot wall earlier. Mitigations: keep context short when possible, use multi-agent architectures (per Anthropic's research, an Opus lead with Sonnet subagents outperformed a single agent by ~90% on research tasks), tool clearing, compaction, and restarting with a written summary when the model starts reintroducing trimmed content or contradicting earlier verdicts in the same session. See Claude 1M Context Performance: Opus vs Sonnet vs Competitors for benchmark numbers and Tool Calling Loop: How a Coding Harness Drives a Stateless Model for why tool-heavy sessions stress attention.

Context Rot in Long AI Coding Sessions: Why Agents Get Worse as Context Fills

Have insights to add?