Develop and Validate A Fair Machine Learning Model to Indentify Patients with High Care-Continuity in Electronic Health Records Data

Y. Lee,Tiange Tang,Yu Huang,J. Bian,Lizheng Shi,Jingchuan Guo

Published 2025 in medRxiv

ABSTRACT

Objectives Electronic health record (EHR) data often missed care outside a given health system, resulting in data discontinuity. We aimed to: (1) quantify misclassification across levels of EHR data discontinuity and identify an optimal continuity threshold. (2) develop a machine learning (ML) model to predict EHR continuity and optimize fairness across racial and ethnic groups, and (3) externally validate the EHR continuity prediction model using an independent dataset. Materials and Methods We used linked OneFlorida+ EHR-Medicaid claims data for model development and REACHnet EHR-Louisiana Blue Cross Blue Shield (LABlue) claims data for external validation. A novel Harmonized Encounter Proportion Score (HEPS) was applied to quantify patient-level EHR data continuity and the impact on misclassification of 42 clinical variables. ML models were trained using routinely available demographic, clinical, and healthcare utilization features derived from structured EHR data. Results Higher EHR data continuity was associated with lower rates of misclassification. A HEPS threshold of approximately 30% effectively distinguished patients with sufficient data continuity. ML models demonstrated strong performance in predicting high continuity (AUROC=0.77). Fairness assessments showed bias against Hispanic group, which was substantially improved following bias mitigation procedures. Model performance remained robust and fair in the external validation. Discussion Our study offers a practical metric for quantifying care continuity in EHR networks. The current ML model incorporating EHR-routinely collected information can accurately identify patients with high care continuity. Conclusions We developed a generalizable care-continuity classification tool that can be easily applied across EHR systems, strengthening the rigor of EHR-based research.

PUBLICATION RECORD

Publication year
2025
Venue
medRxiv
Publication date
2025-11-13
Fields of study
Medicine, Computer Science
Identifiers
DOI 10.1101/2025.11.11.25339938 PMID 41292658 PMCID 12642717
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

The impact of electronic health records (EHR) data continuity on prediction model fairness and racial-ethnic disparities
2023cited by this paper
Data Quality in Electronic Health Record Research: An Approach for Validation and Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes
2022cited by this paper
The OneFlorida Data Trust: a centralized, translational research data infrastructure of statewide scope
2021cited by this paper
External Validation of an Algorithm to Identify Patients with High Data-Completeness in Electronic Health Records for Comparative Effectiveness Research
2020cited by this paper
Dissecting racial bias in an algorithm used to manage the health of populations
2019cited by this paper
Identifying Patients With High Data Completeness to Improve Validity of Comparative Effectiveness Research in Electronic Health Records Data
2018influential reference
Biases in electronic health record data due to processes within the healthcare system: retrospective observational study
2018cited by this paper
Biases introduced by filtering electronic health records for patients with “complete data”
2017cited by this paper
Out-of-system Care and Recording of Patient Characteristics Critical for Comparative Effectiveness Research
2017cited by this paper
A Unified Approach to Interpreting Model Predictions
2017cited by this paper
Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments
2016cited by this paper
XGBoost: A Scalable Tree Boosting System
2016cited by this paper
Equality of Opportunity in Supervised Learning
2016cited by this paper
Building Data Infrastructure to Evaluate and Improve Quality: PCORnet.
2015cited by this paper
Building electronic data infrastructure for comparative effectiveness research: accomplishments, lessons learned and future steps.
2014cited by this paper
Metrics for covariate balance in cohort studies of causal effects
2014cited by this paper
Launching PCORnet, a national patient-centered clinical research network
2014cited by this paper
A combined comorbidity score predicted mortality in elderly patients better than existing scores.
2011cited by this paper
Fairness through awareness
2011cited by this paper
Chronic conditions
2009cited by this paper
Building Classifiers with Independency Constraints
2009cited by this paper
Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples
2009influential reference
Use of health care databases in pharmacoepidemiology.
2006cited by this paper
Validating recommendations for coronary angiography following acute myocardial infarction in the elderly: a matched analysis using propensity scores.
2001influential reference

CITED BY

No citing papers are available for this paper.