Probability Calibration Theory
Calibration ensures that predicted probabilities reflect true likelihoods: when a model predicts 70% confidence, it should be correct 70% of the time.
Why Calibration Matters
Miscalibrated Models
Prediction: 90% confident it's a cat
Reality: Only 60% of 90%-confident predictions are cats
Consequences:
- Decision-making based on wrong probabilities
- Risk underestimation in safety-critical systems
- Ensemble weighting fails
Calibrated Models
Prediction: 70% confident it's a cat
Reality: 70% of 70%-confident predictions are cats
Measuring Calibration
Reliability Diagram
Plot predicted probability vs actual frequency:
Accuracy │ ·
│ ·
│ · Perfect calibration (diagonal)
│ ·
│·
└──────────
Confidence
Expected Calibration Error (ECE)
ECE = Σᵦ (nᵦ/N) · |acc(b) - conf(b)|
Where:
- B = number of bins
- nᵦ = samples in bin b
- acc(b) = accuracy in bin b
- conf(b) = mean confidence in bin b
Maximum Calibration Error (MCE)
MCE = max_b |acc(b) - conf(b)|
Worst-case miscalibration.
Brier Score
BS = (1/N) Σᵢ (pᵢ - yᵢ)²
Combines calibration and refinement.
Calibration Methods
Temperature Scaling
Simple and effective post-hoc calibration:
p_calibrated = softmax(logits / T)
Optimize T on validation set:
T* = argmin_T NLL(softmax(logits/T), y_val)
Typically T > 1 (softens overconfident predictions).
Platt Scaling
Logistic regression on model outputs:
P(y=1|x) = σ(a · f(x) + b)
Learn a, b on validation set.
Isotonic Regression
Non-parametric calibration:
Map predicted probability to calibrated probability
using monotonic (isotonic) function
No parametric assumptions, but needs more data.
Histogram Binning
For each confidence bin [a, b):
calibrated_prob = empirical_accuracy_in_bin
Simple but discontinuous.
Beta Calibration
P_calibrated = 1 / (1 + 1/(exp(c)·((1-p)/p)^a·p^(b-a)))
Three-parameter model, handles asymmetric errors.
When Models Miscalibrate
Overconfidence
Modern neural networks are typically overconfident:
| Model | ECE (before) | ECE (after temp scaling) |
|---|---|---|
| ResNet-110 | 4.5% | 1.2% |
| DenseNet-40 | 3.8% | 0.9% |
Causes:
- Cross-entropy loss encourages extreme predictions
- Batch normalization
- Overparameterization
Underconfidence
Less common, but occurs with:
- Heavy regularization
- Ensemble disagreement
- Out-of-distribution inputs
Calibration for Multi-Class
Per-Class Calibration
P(y=k|x) = calibrator_k(f_k(x))
Separate calibrator per class.
Focal Calibration
L = -Σᵢ (1-pᵢ)^γ log(pᵢ)
Focal loss during training improves calibration.
Calibration Under Distribution Shift
Challenge: Calibration degrades on OOD data.
Domain-Aware Calibration
T_domain = T_base · domain_adjustment
Ensemble Temperature
p = Σₖ wₖ · softmax(logits/Tₖ)
Conformal Prediction
Provide prediction sets with coverage guarantee:
C(x) = {y : s(x,y) ≤ τ}
Where τ chosen so that:
P(y* ∈ C(x)) ≥ 1 - α
Properties:
- Distribution-free
- Finite-sample guarantee
- No model assumptions
Selective Prediction
Abstain when uncertain:
If max(p) < threshold:
return "I don't know"
Trade-off: coverage vs accuracy on non-abstained predictions.
References
- Guo, C., et al. (2017). "On Calibration of Modern Neural Networks." ICML.
- Platt, J. (1999). "Probabilistic Outputs for Support Vector Machines."
- Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities with Supervised Learning." ICML.
- Angelopoulos, A., & Bates, S. (2021). "A Gentle Introduction to Conformal Prediction." arXiv.