Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation

S. Wahl,A. Boulesteix,A. Zierer,B. Thorand,Mark Avan de Wiel

Published 2016 in BMC Medical Research Methodology

ABSTRACT

BackgroundMissing values are a frequent issue in human studies. In many situations, multiple imputation (MI) is an appropriate missing data handling strategy, whereby missing values are imputed multiple times, the analysis is performed in every imputed data set, and the obtained estimates are pooled. If the aim is to estimate (added) predictive performance measures, such as (change in) the area under the receiver-operating characteristic curve (AUC), internal validation strategies become desirable in order to correct for optimism. It is not fully understood how internal validation should be combined with multiple imputation.MethodsIn a comprehensive simulation study and in a real data set based on blood markers as predictors for mortality, we compare three combination strategies: Val-MI, internal validation followed by MI on the training and test parts separately, MI-Val, MI on the full data set followed by internal validation, and MI(-y)-Val, MI on the full data set omitting the outcome followed by internal validation. Different validation strategies, including bootstrap und cross-validation, different (added) performance measures, and various data characteristics are considered, and the strategies are evaluated with regard to bias and mean squared error of the obtained performance estimates. In addition, we elaborate on the number of resamples and imputations to be used, and adopt a strategy for confidence interval construction to incomplete data.ResultsInternal validation is essential in order to avoid optimism, with the bootstrap 0.632+ estimate representing a reliable method to correct for optimism. While estimates obtained by MI-Val are optimistically biased, those obtained by MI(-y)-Val tend to be pessimistic in the presence of a true underlying effect. Val-MI provides largely unbiased estimates, with a slight pessimistic bias with increasing true effect size, number of covariates and decreasing sample size. In Val-MI, accuracy of the estimate is more strongly improved by increasing the number of bootstrap draws rather than the number of imputations. With a simple integrated approach, valid confidence intervals for performance estimates can be obtained.ConclusionsWhen prognostic models are developed on incomplete data, Val-MI represents a valid strategy to obtain estimates of predictive performance measures.

PUBLICATION RECORD

Publication year
2016
Venue
BMC Medical Research Methodology
Publication date
2016-10-26
Fields of study
Medicine, Computer Science
Identifiers
DOI 10.1186/s12874-016-0239-7 PMID 27782817 PMCID 5080703
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Multiple Imputation For Nonresponse In Surveys
2016influential reference
Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications
2016cited by this paper
A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization
2015cited by this paper
The Net Reclassification Index (NRI): A Misleading Measure of Prediction Improvement Even with Independent Test Data Sets
2015cited by this paper
The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data
2015influential reference
Correcting for Optimistic Prediction in Small Data Sets
2014cited by this paper
R: A language and environment for statistical computing.
2014cited by this paper
Validation of prediction models based on lasso regression with multiply imputed data
2014cited by this paper
Impute vs. Ignore: Missing values for prediction
2013cited by this paper
Myeloperoxidase is associated with incident coronary heart disease independently of traditional risk factors: results from the MONICA/KORA Augsburg study
2012cited by this paper
Multiple Imputation Using SAS Software
2011cited by this paper
Effect of Serum 25-Hydroxyvitamin D on Risk for Type 2 Diabetes May Be Partially Mediated by Subclinical Inflammation
2011cited by this paper
Statistical Applications in Genetics and Molecular Biology Calculating Confidence Intervals for Prediction Error in Microarray Classification Using Resampling
2011cited by this paper
Immunological and Cardiometabolic Risk Factors in the Prediction of Type 2 Diabetes and Coronary Events: MONICA/KORA Augsburg Case-Cohort Study
2011cited by this paper
MICE: Multivariate Imputation by Chained Equations in R
2011cited by this paper
PredictABEL: an R package for the assessment of risk prediction models
2011cited by this paper
pROC: an open-source package for R and S+ to analyze and compare ROC curves
2011cited by this paper
Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers
2011influential reference
State of the Multiple Imputation Software.
2011cited by this paper
Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures
2010cited by this paper
Development and validation of a prediction model with missing predictor data: a practical approach.
2010cited by this paper
Improvement of risk prediction by genomic profiling: reclassification measures versus the area under the receiver operating characteristic curve.
2010cited by this paper
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
2010cited by this paper
The search for stable prognostic models in multiple imputed data sets
2010cited by this paper
Testing the prediction error difference between 2 predictors.
2009cited by this paper
Computation of Multivariate Normal and t Probabilities
2009cited by this paper
Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
2008cited by this paper
Classifier performance prediction for computer-aided diagnosis using a limited dataset.
2008cited by this paper
Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond
2008cited by this paper
The Performance of Risk Prediction Models
2008cited by this paper
19 Incomplete Data in Epidemiology and Medical Statistics
2007cited by this paper
Evaluating Prediction Rules for t-Year Survivors With Censored Regression Models
2007cited by this paper
Regression with missing Ys: An improved strategy for analyzing multiply imputed data
2007cited by this paper
Bmc Medical Research Methodology Open Access Variable Selection under Multiple Imputation Using the Bootstrap in a Prognostic Study
2007cited by this paper
SPECIAL SERIES: ORIGINAL ARTICLES Using the outcome for imputation of missing predictor values was preferred
2006cited by this paper
Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors
2005influential reference
KORA - A Research Platform for Population Based Health Research
2005cited by this paper
A Comparison of Nonparametric Error Rate Estimation Methods in Classification Problems
2004cited by this paper
Is cross-validation valid for small-sample microarray classification?
2004cited by this paper
A multivariate technique for multiply imputing missing values using a sequence of regression models
2001cited by this paper
Internal validation of predictive models: efficiency of some procedures for logistic regression analysis.
2001cited by this paper
Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker
2000cited by this paper
Multiple imputation of missing blood pressure covariates in survival analysis.
1999cited by this paper
Improvements on Cross-Validation: The 632+ Bootstrap Method
1997cited by this paper
Bootstrap for Imputed Survey Data
1996cited by this paper
Linear Combinations of Multiple Diagnostic Markers
1993cited by this paper
Validation techniques for logistic regression models.
1991cited by this paper
Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation
1983cited by this paper
Evaluating the yield of medical tests.
1982cited by this paper
VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY
1950cited by this paper

CITED BY

Development of a nomogram to predict the need for more than one Nuss bar in pectus excavatum.
2026cites this paper
Identification of key cardiovascular disease predictive factors from the China Health and Retirement Longitudinal Study dataset using machine learning-based algorithms.
2026cites this paper
Universal vs selective tranexamic acid use for preventing blood loss after cesarean delivery: a reanalysis of the TRAAP-2 trial.
2026cites this paper
Missing Values in Empirical Research: Theory and Practice. Part 39 of a Series on the Evaluation of Scientific Publications.
2026cites this paper
Nailfold video capillaroscopy predicts severe progression at three years in systemic sclerosis: Results from SCLEROCAP study.
2025cites this paper
Prediction of chronic limb threatening ischemia by clinical data: the PREDICCMI study.
2025cites this paper
Combining multiple imputation with internal model validation in clinical prediction modeling: a systematic methodological review.
2025cites this paper
A comparison of modeling approaches for static and dynamic prediction of central-line bloodstream infections using electronic health records (part 1): regression models
2025cites this paper
LASSO-Based Survival Prediction Modeling with Multiply Imputed Data: A Case Study in Tuberculosis Mortality Prediction
2025cites this paper
Data Imputation Based on Retrieval-Augmented Generation
2025cites this paper
Preoperative Chest CT Myosteatosis Indicates Worse Postoperative Survival in Stage 0-IIB Non-Small Cell Lung Cancer.
2025cites this paper
Predictors of Secukinumab Treatment Response and Continuation in Axial Spondyloarthritis: Results From the EuroSpA Research Collaboration Network
2025cites this paper
Comparison of the Cox proportional hazards model and Random Survival Forest algorithm for predicting patient-specific survival probabilities in clinical trial data
2025cites this paper
Problem of pain in the USA: evaluating the generalisability of high-impact chronic pain models over time using National Health Interview Survey (NHIS) data
2025cites this paper
Real-world analysis of immune checkpoint inhibitor efficacy and response predictors in patients treated at the CCCMunichLMU outpatient clinic
2025cites this paper
Risk prediction in people with acute myocardial infarction in England: a cohort study using data from 1521 general practices
2025cites this paper
An evaluation of synthetic data augmentation for mitigating covariate bias in health data
2024cites this paper
Combining Missing Data Imputation and Internal Validation in Clinical Risk Prediction Models
2024cites this paper
Handling missing data and measurement error for early-onset myopia risk prediction models
2024cites this paper
Developing clinical prediction models: a step-by-step guide
2024cites this paper
Comparing DAPSA, DAPSA28, and DAS28‐CRP in Patients With Psoriatic Arthritis Initiating a First Tumor Necrosis Factor Inhibitor Across Nine European Countries
2024cites this paper
A comparison of regression models for static and dynamic prediction of a prognostic outcome during admission in electronic health care records
2024cites this paper
Development and validation of machine learning models to predict frailty risk for elderly.
2024cites this paper
A prediction model for differentiating recurrent Kawasaki disease from other febrile illnesses.
2024cites this paper
Development of a gastric cancer risk calculator for questionnaire-based surveillance of Iranian dyspeptic patients
2024cites this paper
Maternal Dietary Diversity and Birth Weight in Offspring: Evidence from a Chinese Population-Based Study
2023cites this paper
Selecting, optimizing and externally validating a preexisting machine-learning regression algorithm for estimating waist circumference
2023cites this paper
Clinical Characteristics of Primary Snoring vs Mild Obstructive Sleep Apnea in Children: Analysis of the Pediatric Adenotonsillectomy for Snoring (PATS) Randomized Clinical Trial.
2023cites this paper
Developing and externally validating multinomial prediction models for methotrexate treatment outcomes in patients with rheumatoid arthritis: Results from an international collaboration.
2023cites this paper
Linear classification methods for multivariate repeated measures data — A simulation study
2023cites this paper
Risk Factors for Early Onset Sporadic Colorectal Cancer in Male Veterans.
2023cites this paper
Risk assessment for major adverse cardiovascular events after noncardiac surgery using self-reported functional capacity: international prospective cohort study.
2023cites this paper
Are social determinants of health associated with the development of early complications among young adults with type 2 diabetes? A population based study using linked databases.
2023cites this paper
Improving Cardiovascular Disease Prediction Using Automated Coronary Artery Calcium Scoring from Existing Chest CTs
2022cites this paper
Development and external validation of a new clinical prediction model for early recognition of sepsis in adult patients in primary care: a diagnostic study
2022cites this paper
New clinical prediction model for early recognition of sepsis in adult primary care patients: a prospective diagnostic cohort study of development and external validation
2022cites this paper
Gastric cancer biomarker analysis in patients treated with different adjuvant chemotherapy regimens within SAMIT, a phase III randomized controlled trial
2022cites this paper
Development and validation of models for predicting the overall survival and cancer-specific survival of patients with primary vaginal cancer: A population-based retrospective cohort study
2022cites this paper
Handling Missing Data in Clinical Research.
2022cites this paper
Predicting the Risk of Human Immunodeficiency Virus Type 1 (HIV-1) Acquisition in Rural South Africa Using Geospatial Data
2022cites this paper
Comparing linear discriminant analysis and supervised learning algorithms for binary classification—A method comparison study
2022cites this paper
Exploratory analysis of pre and postoperative risk stratification tools to identify acute kidney and myocardial injury in patients undergoing surgery for chronic subdural haematoma
2021cites this paper
Clinical and Epidemiological Features of Acute Zika Virus Infections in León, Nicaragua.
2021cites this paper
Development and external validation of an admission risk prediction model after treatment from early intervention in psychosis services
2021cites this paper
Prediction of Persistent Pain Severity and Impact 12 Months After Breast Surgery Using Comprehensive Preoperative Assessment of Biopsychosocial Pain Modulators
2021cites this paper
The development and validation of prognostic models for overall survival in the presence of missing data in the training dataset: a strategy with a detailed example
2021cites this paper
Development and internal validation of a clinical prediction model for 90-day mortality after lung resection: the RESECT-90 score.
2021cites this paper
A survey on missing data in machine learning
2021cites this paper
Optimising specialist geriatric medicine services by telehealth
2021cites this paper
Complement C3 identified as a unique risk factor for disease severity among young COVID-19 patients in Wuhan, China
2021cites this paper
Hacia un modelo predictivo de carácter preventivo del riesgo de infección por COVID-19
2021cites this paper
Development and validation of prediction model to estimate 10-year risk of all-cause mortality using modern statistical learning methods: a large population-based cohort study and external validation
2021cites this paper
A clinical prediction model for unsuccessful pulmonary tuberculosis treatment outcomes.
2021cites this paper
Developing more generalizable prediction models from pooled studies and large clustered data sets
2021cites this paper
Calibration of prediction rules for life-time outcomes using prognostic Cox regression survival models and multiple imputations to account for missing predictor data with cross-validatory assessment
2021cites this paper
Vaccination coverage estimation in Mexico in children under five years old: Trends and associated factors
2021cites this paper
Biomarker-defined pathways for incident type 2 diabetes and coronary heart disease—a comparison in the MONICA/KORA study
2020cites this paper
Student Academic Performance Prediction with Recurrent Neural Network
2020cites this paper
Adaptive sample size determination for the development of clinical prediction models
2020cites this paper
Complement C3 identified as a unique Risk Factor for Disease Severity among Young COVID-19 Patients in Wuhan
2020cites this paper
A Panel of 6 Biomarkers Significantly Improves the Prediction of Type 2 Diabetes in the MONICA/KORA Study Population
2020cites this paper
Handling missing predictor values when validating and applying a prediction model to new patients
2020cites this paper
Deep Learning for Improved Risk Prediction in Surgical Outcomes
2020cites this paper
Development and evaluation of an osteoarthritis risk model for integration into primary care health information technology
2020cites this paper
Editorial: Caring for Those Who Are Neglected and Forgotten: Psychiatry in Prison Environments
2020cites this paper
Construction and assessment of prediction rules for binary outcome in the presence of missing predictor data using multiple imputation and cross‐validation: Methodological approach and data‐based evaluation
2020cites this paper
Methodological considerations when analysing and interpreting real-world data.
2020cites this paper
Development and Validation of a Nomogram for Predicting the Disease Progression of Nonsevere Coronavirus Disease 2019
2020cites this paper
Guidelines for the emergency department management of traumatic brain injury : an impact assessment and development of a prognostic model to inform hospital admission decisions
2019cites this paper
Comparison between EM Algorithm and Multiple Imputation on Predicting Children’s Weight at School Entry
2019cites this paper
Risk prediction of cervical abnormalities: The value of sociodemographic and lifestyle factors in addition to HPV status.
2019cites this paper
Development of a Clinical Decision Rule for the Early Safe Discharge of Patients with Mild Traumatic Brain Injury and Findings on Computed Tomography Brain Scan: A Retrospective Cohort Study
2019cites this paper
Comparison of Machine Learning Techniques for Prediction of Hospitalization in Heart Failure Patients
2019cites this paper
Nonparametric Regression Estimates Based on Imputation Techniques for Right-Censored Data
2019cites this paper
Identifying Violent Behavior Using the Oxford Mental Illness and Violence Tool in a Psychiatric Ward of a German Prison Hospital
2019cites this paper
Real World Data: special section Methodological considerations when analysing and interpreting real-world data
2019cites this paper
Incretin-Based Therapies and Diabetic Retinopathy: Real-World Evidence in Older U.S. Adults
2018cites this paper
Construction and assessment of prediction rules for binary outcome in the presence of missing predictor data using multiple imputation: theoretical perspective and data-based evaluation
2018cites this paper
Derivation and external validation of a clinical version of the German Diabetes Risk Score (GDRS) including measures of HbA1c
2018influential citation
Predicting the risk of apparent treatment-resistant hypertension: a longitudinal, cohort study in an urban hypertension referral clinic.
2018cites this paper
Circulating Levels of Interleukin 1-Receptor Antagonist and Risk of Cardiovascular Disease: Meta-Analysis of Six Population-Based Cohorts
2017cites this paper
Technical review : performance of existing imputation methods for missing data in SVM ensemble creation
2017cites this paper
Ultra-sensitive troponin I is an independent predictor of incident coronary heart disease in the general population
2017cites this paper
SVM ENSEMBLE CREATION
2017cites this paper
Improving cardiovascular risk prediction through more accurate and alternative methods of blood pressure measurement
2017cites this paper
Erratum to: Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation
2016cites this paper
Clinical and Epidemiological Features of Acute Zika Virus Infections in Le (cid:1) on, Nicaragua
year unknowncites this paper