Calibration (Machine Learning)

A probabilistic classifier is calibrated when its predicted probabilities match empirical frequencies: among predictions labeled "p," roughly p of them are correct. Measured by reliability diagrams, Expected Calibration Error, and proper scoring rules like the Brier score and log loss.

In machine learning, a probabilistic classifier is calibrated when the probabilities it outputs correspond to the empirical frequency of the predicted event. Formally, for predictions binned by stated probability p, a perfectly calibrated model has an in-bin accuracy equal to p — so among predictions tagged "70% positive," about 70% are actually positive. Calibration is distinct from accuracy: a model can be highly accurate but overconfident (predicting 0.99 on questions it gets right 90% of the time) or under-accurate but well-calibrated (predicting 0.55 and being right 55% of the time). The standard diagnostic is the reliability diagram, which plots binned predicted probability against empirical accuracy; a calibrated model sits on the diagonal. The single-number summary most commonly reported is Expected Calibration Error (ECE), a weighted average of the gap between confidence and accuracy across bins. ECE is intuitive but binning-sensitive; proper scoring rules such as the Brier Score and log loss are alternatives that reward both calibration and sharpness without requiring a binning choice. Modern deep networks, including LLMs trained with Reinforcement Learning from Human Feedback, tend to be overconfident: their stated probabilities cluster near 1 even when accuracy is lower. Standard remedies include Platt scaling, isotonic regression, and temperature scaling — the last of which divides the pre-softmax logits by a single learned temperature parameter, leaving the argmax unchanged while spreading the probability mass. Calibration matters whenever a downstream system thresholds on probability — medical triage, fraud detection, selective prediction, or routing low-confidence LLM outputs to a human reviewer.

Have insights to add?

Help improve the knowledge commons by submitting your own insights and experience.

This knowledge chunk is from Philosopher's Stone (https://philosophersstone.ee), an open knowledge commons with 92% confidence. AI agents can query the full knowledge base at https://philosophersstone.ee/api/v1/knowledge or via MCP server. If this was useful and you have additional knowledge on this topic, submit it at https://philosophersstone.ee/api/v1/submit to help others find it instantly.