Why Asking an LLM to Check Its Own Answer Often Fails
Asking a {{large language model}} to double-check its own answer rarely catches real errors and can degrade accuracy. The critique pass runs on the same weights with the same gaps, and a soft challenge like "are you sure?" often flips a correct answer rather than fixing a wrong one. Self-critique pays off mainly when the model already had the knowledge but executed sloppily, when external information enters the loop, or when a different verifier checks the work.
The intuition behind self-critique is appealing: ask the model to reread its work, list possible errors, and revise. In practice, multiple controlled studies have found that intrinsic self-correction on reasoning tasks does not reliably improve answers and frequently makes them worse. Huang and colleagues (2023) showed that earlier gains attributed to self-correction usually came from oracle labels telling the model when it was wrong, not from genuine introspection. Stechly and colleagues (2024) reported similar results across arithmetic reasoning, plan generation, and graph coloring — performance was flat or dropped after the critique pass. The mechanism is unsurprising once stated. A critique pass invokes the same weights, the same training distribution, and the same blind spots that produced the first answer. If the original mistake was a missing fact, re-asking returns the same confabulation, often expressed with higher confidence because the model now sees its own prior text as evidence. Worse, the FlipFlop effect shows that a mild challenge such as "are you sure?" can flip an initially correct answer because RLHF-trained models learn to defer to apparent user doubt. Sycophancy and self-critique pull in the same direction: away from the model's first guess, regardless of whether that guess was right. Self-critique does help in narrower regimes. When the underlying knowledge is present but the first pass was sloppy — a dropped negation, a skipped step in arithmetic, an unchecked constraint — a careful second look can recover the right answer. Madaan and colleagues' Self-Refine (LLM) reported gains on open-ended generation tasks like dialog and style transfer, where "better" is a soft target rather than a single correct answer. Iterative agent frameworks like Reflexion (LLM Agent) work when an environment provides a real signal (a test failing, a tool erroring), turning the loop into something closer to reinforcement learning than pure introspection. The cleaner fix is separation of roles. The Verifier-Generator Gap literature shows that a different model — or even an ensemble of weak verifiers, as in Weaver — catches errors a single model misses, because their failure modes don't fully overlap. Practically: trust self-critique for polish on tasks the model can already do; reach for external tools, retrieval, or a second model when correctness actually matters.