Brier Score
Strictly proper scoring rule introduced by Glenn W. Brier in 1950 to evaluate probabilistic forecasts. Computes the mean squared error between predicted probabilities and realized outcomes; lower is better, with 0 indicating a perfect forecast.
The "Brier score" is a strictly proper scoring rule introduced by the meteorologist Glenn W. Brier in 1950 to evaluate probabilistic forecasts. For a binary outcome, it is the mean squared error between the predicted probability and the realized 0/1 outcome, averaged over the forecast set. Scores range from 0 (perfect) to 1 (worst), and the rule is "proper" in the sense that a forecaster minimizes their expected score only by reporting their true subjective probability — there is no incentive to hedge or exaggerate. Brier scores admit a useful three-way Murphy decomposition into reliability (a calibration term: how close stated probabilities are to conditional empirical frequencies), resolution (how much forecasts vary across conditions, rewarding sharpness), and uncertainty (intrinsic variance of the outcome, a property of the data not the forecaster). This decomposition makes the Brier score a richer diagnostic than a raw error metric: a low score requires both calibration and informativeness. In machine learning the Brier score is commonly reported alongside Expected Calibration Error and log loss for probabilistic classifiers. Compared with ECE it has the advantage of being binning-free and differentiable, so it can be optimized directly during training. Compared with log loss it penalizes confident wrong answers less severely (quadratically rather than via a logarithm that diverges at zero), which makes it more robust to outliers but less sensitive to extreme miscalibration. It is widely used to assess weather forecasts, medical risk models, and — more recently — the probabilistic outputs of language and vision models. See Calibration (Machine Learning) for the broader framework.