Improving drug identification in overdose death surveillance by using clinical natural language processing models.

Arthur Funnell,P. Petousis,Fabrice Harel-Canada,Ruby Romero,Alex A. T. Bui,A. Koncsol,Hritika Chaturvedi,Chelsea L. Shover,David Goodman-Meza

Published 2026 in Journal of Forensic Sciences

ABSTRACT

The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple US jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3335 records from 2023 to 2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models (LLMs) such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores ≥0.998 on the internal test set. External validation confirmed robustness (macro F1 = 0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only LLMs. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.

PUBLICATION RECORD

CITATION MAP

EXTRACTION MAP

CLAIMS

  • No claims are published for this paper.

CONCEPTS

  • No concepts are published for this paper.

REFERENCES

Showing 1-14 of 14 references · Page 1 of 1