Confidence Calibration in LLM Outputs

Whether a language model's stated confidence tracks its actual accuracy. Verbalized confidence in chat models is poorly calibrated and skewed high by RLHF; raw token logits are better calibrated but are hidden behind most chat APIs. Cheap estimators — self-consistency, ensemble disagreement, and "would you bet money?" framing — partially close the gap.

A model is calibrated when its stated probabilities match its long-run hit rate: among predictions it tags "80% confident," roughly 80% should be correct. Calibration is measured with metrics like Calibration (Machine Learning) and Brier Score, which compare predicted probabilities to realized outcomes. For LLMs, calibration matters because downstream users — retrieval pipelines, agents, human reviewers — route decisions based on how sure the model claims to be. There are two ways to get a confidence number out of an LLM. The first is verbalized confidence: ask the model in natural language ("How sure are you, 0–100?") and parse the answer. The second is the token probability read directly from the next-token logits. Kadavath et al. (2022), "Language Models (Mostly) Know What They Know," showed that sufficiently large base models are reasonably well-calibrated on multiple-choice and true/false questions when scored by their own token probabilities, and that asking the model to evaluate P(True) of its own draft answer is itself a usable calibration signal. Lin et al. (2022), "Teaching Models to Express Their Uncertainty in Words," demonstrated that a fine-tuned GPT-3 could emit calibrated numerical confidences purely as output tokens without exposing logits — their "verbalized probability" baseline. RLHF disturbs this picture. Tian et al. (2023), "Just Ask for Calibration," found that for RLHF-tuned chat models such as ChatGPT, GPT-4, and Claude, verbalized confidences are actually better calibrated than the post-RLHF token probabilities — reducing Expected Calibration Error by roughly 50% relative on TriviaQA, SciQ, and TruthfulQA. The underlying reason is that reward modeling rewards confident-sounding answers, collapsing the logit distribution toward overconfidence on the chosen token. Later work on PPO-with-calibrated-rewards reports verbalized scores from RLHF models clustering between 80% and 100% with ECE above 0.30 on knowledge tasks. Practically, the only confidence signals most API users can reach are behavioral. Self-Consistency Decoding samples multiple reasoning paths at non-zero temperature and treats the majority fraction as a confidence proxy; cross-model ensemble disagreement catches confident errors that any single model would repeat. Asking the model to bet money on its answer, or to estimate P(True) after generating it, often elicits more conservative numbers than "how confident are you?" alone. None of these match true logit-level calibration on a base model, but they are the realistic toolkit when the logits are not exposed and the verbal confidence has already been RLHF-trained into the high 90s.

Confidence Calibration in LLM Outputs

Related Knowledge

Calibration (Machine Learning)

Why LLMs Prefer Plausible Over True

Have insights to add?