Mitigating bias in calibration error estimation

R. Roelofs,Nicholas Cain,Jonathon Shlens,M. Mozer

Published 2020 in International Conference on Artificial Intelligence and Statistics

ABSTRACT

Building reliable machine learning systems requires that we correctly understand their level of confidence. Calibration focuses on measuring the degree of accuracy in a model's confidence and most research in calibration focuses on techniques to improve an empirical estimate of calibration error, ECE_bin. Using simulation, we show that ECE_bin can systematically underestimate or overestimate the true calibration error depending on the nature of model miscalibration, the size of the evaluation data set, and the number of bins. Critically, ECE_bin is more strongly biased for perfectly calibrated models. We propose a simple alternative calibration error metric, ECE_sweep, in which the number of bins is chosen to be as large as possible while preserving monotonicity in the calibration function. Evaluating our measure on distributions fit to neural network confidence scores on CIFAR-10, CIFAR-100, and ImageNet, we show that ECE_sweep produces a less biased estimator of calibration error and therefore should be used by any researcher wishing to evaluate the calibration of models trained on similar datasets.

PUBLICATION RECORD

Publication year
2020
Venue
International Conference on Artificial Intelligence and Statistics
Publication date
2020-12-15
Fields of study
Mathematics, Computer Science
Identifiers
arXiv 2012.08668
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Soft Calibration Objectives for Neural Networks
2021cited by this paper
Calibration of Neural Networks using Splines
2020influential reference
Improving model calibration with accuracy versus uncertainty optimization
2020cited by this paper
Calibrating Deep Neural Networks using Focal Loss
2020cited by this paper
Mix-n-Match: Ensemble and Compositional Methods for Uncertainty Calibration in Deep Learning
2020influential reference
Measuring Calibration in Deep Learning
2019cited by this paper
nuScenes: A Multimodal Dataset for Autonomous Driving
2019cited by this paper
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
2019cited by this paper
A guide to deep learning in healthcare
2019cited by this paper
Calibration tests in multi-class classification: A unifying framework
2019cited by this paper
Verified Uncertainty Calibration
2019influential reference
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
2019cited by this paper
Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
2019cited by this paper
Do ImageNet Classifiers Generalize to ImageNet?
2019cited by this paper
Why do deep convolutional networks generalize so poorly to small image transformations?
2018cited by this paper
Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings
2018influential reference
Focal Loss for Dense Object Detection
2017cited by this paper
Dermatologist-level classification of skin cancer with deep neural networks
2017cited by this paper
Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning
2017cited by this paper
On Calibration of Modern Neural Networks
2017influential reference
Wide Residual Networks
2016cited by this paper
Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs.
2016cited by this paper
Deep Networks with Stochastic Depth
2016influential reference
Densely Connected Convolutional Networks
2016influential reference
Calibration of medical diagnostic classifier scores to the probability of disease
2016cited by this paper
Calibrated Structured Prediction
2015cited by this paper
Deep Residual Learning for Image Recognition
2015influential reference
Obtaining Well Calibrated Probabilities Using Bayesian Binning
2015influential reference
Intriguing properties of neural networks
2013cited by this paper
Vision meets robotics: The KITTI dataset
2013cited by this paper
I-spline Smoothing for Calibrating Predictive Models
2012cited by this paper
Estimating reliability and resolution of probability forecasts through decomposition of the empirical score
2012influential reference
A bias‐corrected decomposition of the Brier score
2012influential reference
Machine learning - a probabilistic perspective
2012cited by this paper
On the convexity of ROC curves estimated from radiological test results.
2010cited by this paper
Bayesian data analysis.
2010cited by this paper
Learning Multiple Layers of Features from Tiny Images
2009influential reference
ImageNet: A large-scale hierarchical image database
2009cited by this paper
Receiver Operating Characteristic (ROC) Curves
2006cited by this paper
Transforming classifier scores into accurate multiclass probability estimates
2002influential reference
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
2001influential reference
Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods
1999cited by this paper
Gradient-based learning applied to document recognition
1998cited by this paper
Signal detection theory and psychophysics
1966cited by this paper
VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY
1950cited by this paper

CITED BY

MUSE: Multi-Tenant Model Serving With Seamless Model Updates
2026cites this paper
The Confidence Trap: Gender Bias and Predictive Certainty in LLMs
2026cites this paper
A Variational Estimator for $L_p$ Calibration Errors
2026cites this paper
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
2025cites this paper
How Well Calibrated are Extreme Multi-label Classifiers? An Empirical Analysis
2025cites this paper
Can a calibration metric be both testable and actionable?
2025cites this paper
Impact of pectoral muscle removal on deep-learning-based breast cancer risk prediction
2025cites this paper
The Capabilities and Limitations of Weak-to-Strong Generalization: Generalization and Calibration
2025cites this paper
Rethinking Early Stopping: Refine, Then Calibrate
2025cites this paper
Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment
2025influential citation
Understanding the Capabilities and Limitations of Weak-to-Strong Generalization
2025cites this paper
Random Forest Calibration
2025cites this paper
A High-Precision Calibration and Evaluation Method Based on Binocular Cameras and LiDAR for Intelligent Vehicles
2025cites this paper
Understanding Model Calibration - A gentle introduction and visual exploration of calibration and the expected calibration error (ECE)
2025influential citation
A Survey on Confidence Calibration of Deep Learning-Based Classification Models Under Class Imbalance Data
2025cites this paper
Stochastic Explicit Calibration Algorithm for Survival Models
2025cites this paper
Aligning NLP Models with Target Population Perspectives using PAIR: Population-Aligned Instance Replication
2025cites this paper
Multi-Objective Optimization for Deep Neural Network Calibration
2025cites this paper
Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation
2025cites this paper
Transfer learning for assessing Parkinson’s disease: Analysis of wrist-worn sensors data and time-series imaging
2025cites this paper
Analysis and Discussion on the Generalization Ability of Radial Basis Function Network Model
2025cites this paper
Efficient Calibration for Decision Making
2025cites this paper
Attention-Gated CNN and Discrete Wavelet Transform based Ensemble Framework for Brain Hemorrhage Classification
2025cites this paper
Let the Experts Speak: Improving Survival Prediction & Calibration via Mixture-of-Experts Heads
2025cites this paper
Scalable Utility-Aware Multiclass Calibration
2025cites this paper
meval: A Statistical Toolbox for Fine-Grained Model Performance Analysis
2025cites this paper
MCGrad: Multicalibration at Web Scale
2025cites this paper
Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration
2025cites this paper
Mitigating the risk of health inequity exacerbated by large language models
2024cites this paper
Combining Priors with Experience: Confidence Calibration Based on Binomial Process Modeling
2024influential citation
PERFEX-I: confidence scores for image classification using decision trees
2024cites this paper
Labels in Extremes: How Well Calibrated are Extreme Multi-label Classifiers?
2024cites this paper
Enhancing Security through Intelligent Threat Detection and Response: The Integration of Artificial Intelligence in Cyber-Physical Systems
2024cites this paper
Optimizing Estimators of Squared Calibration Errors in Classification
2024cites this paper
Open-Vocabulary Calibration for Fine-tuned CLIP
2024cites this paper
Calibrating Expressions of Certainty
2024cites this paper
A Survey on the Honesty of Large Language Models
2024influential citation
How Flawed is ECE? An Analysis via Logit Smoothing
2024cites this paper
Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification
2024cites this paper
Conformal Prediction for Natural Language Processing: A Survey
2024cites this paper
Improving Deep Learning Model Calibration for Cardiac Applications using Deterministic Uncertainty Networks and Uncertainty-aware Training
2024cites this paper
Information-theoretic Generalization Analysis for Expected Calibration Error
2024cites this paper
Reassessing How to Compare and Improve the Calibration of Machine Learning Models
2024cites this paper
PAC-Bayes Analysis for Recalibration in Classification
2024cites this paper
Trustworthy and Practical AI for Healthcare: A Guided Deferral System with Large Language Models
2024cites this paper
Two fundamental limits for uncertainty quantification in predictive inference
2024cites this paper
Calibration methods in imbalanced binary classification
2024cites this paper
Improving Predictor Reliability with Selective Recalibration
2024cites this paper
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
2024cites this paper
On the Within-Group Discrimination of Screening Classifiers
2023cites this paper
Evaluating Probabilistic Classifiers: The Triptych
2023influential citation
On the Within-Group Fairness of Screening Classifiers
2023cites this paper
An Operational Perspective to Fairness Interventions: Where and How to Intervene
2023cites this paper
Calibrating a Deep Neural Network with Its Predecessors
2023cites this paper
On (assessing) the fairness of risk score models
2023influential citation
Uncertainty Estimation by Fisher Information-based Evidential Deep Learning
2023cites this paper
Online Platt Scaling with Calibeating
2023cites this paper
Calibration Error Estimation Using Fuzzy Binning
2023cites this paper
Document Understanding Dataset and Evaluation (DUDE)
2023cites this paper
Minimum-Risk Recalibration of Classifiers
2023cites this paper
Dual Focal Loss for Calibration
2023cites this paper
Perception and Semantic Aware Regularization for Sequential Confidence Calibration
2023cites this paper
Uncertainty aware training to improve deep learning model calibration for classification of cardiac MR images
2023cites this paper
Set Learning for Accurate and Calibrated Models
2023cites this paper
Mitigating Calibration Bias Without Fixed Attribute Grouping for Improved Fairness in Medical Imaging Analysis
2023cites this paper
Non‐parametric inference on calibration of predicted risks
2023cites this paper
Model Calibration in Dense Classification with Adaptive Label Perturbation
2023cites this paper
Calibration in Deep Learning: A Survey of the State-of-the-Art
2023cites this paper
How Image Corruption and Perturbation Affect Out-of-Distribution Generalization and Calibration
2023cites this paper
A Benchmark Study on Calibration
2023cites this paper
Research on indoor positioning method based on LoRa-improved fingerprint localization algorithm
2023cites this paper
Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing
2023cites this paper
Towards a data-driven debt collection strategy based on an advanced machine learning framework
2023cites this paper
A User-Focused Approach to Evaluating Probabilistic and Categorical Forecasts
2023cites this paper
Comparing the Robustness of ResNet, Swin-Transformer, and MLP-Mixer under Unique Distribution Shifts in Fundus Images
2023influential citation
Consistent and Asymptotically Unbiased Estimation of Proper Calibration Errors
2023cites this paper
Optimal bin number for histogram binning method to calibrate binary probabilities
2023cites this paper
Honest calibration assessment for binary outcome predictions
2022cites this paper
T-Cal: An optimal test for the calibration of predictive models
2022cites this paper
MBCT: Tree-Based Feature-Aware Binning for Individual Uncertainty Calibration
2022cites this paper
Detection and Mitigation of Algorithmic Bias via Predictive Parity
2022influential citation
Interpolation-aware models for train-test consistency in mixup
2022cites this paper
Annealing Double-Head: An Architecture for Online Calibration of Deep Neural Networks
2022cites this paper
A Unifying Theory of Distance from Calibration
2022cites this paper
AdaFocal: Calibration-aware Adaptive Focal Loss
2022influential citation
Predicting Inter-annotator Agreements to Improve Calibration and Performance of Speech Emotion Classifiers
2022cites this paper
Human alignment of neural network representations
2022cites this paper
Beyond calibration: estimating the grouping loss of modern neural networks
2022cites this paper
A Consistent and Differentiable Lp Canonical Calibration Error Estimator
2022cites this paper
ImageNet-Cartoon and ImageNet-Drawing: Two domain shift datasets for ImageNet
2022cites this paper
Better Uncertainty Calibration via Proper Scores for Classification and Beyond
2022cites this paper
Calibrated Selective Classification
2022cites this paper
Sample-dependent Adaptive Temperature Scaling for Improved Calibration
2022cites this paper
A Reduction to Binary Approach for Debiasing Multiclass Datasets
2022cites this paper
Metrics of calibration for probabilistic predictions
2022cites this paper
Calibration Error for Heterogeneous Treatment Effects
2022cites this paper
On the usefulness of the fit-on-test view on evaluating calibration of classifiers
2022cites this paper
Trustworthy Deep Learning via Proper Calibration Errors: A Unifying Approach for Quantifying the Reliability of Predictive Uncertainty
2022cites this paper
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
2022cites this paper
Distribution-free calibration guarantees for histogram binning without sample splitting
2021cites this paper