A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain

Denis R. Griffis,Chaitanya P. Shivade,E. Fosler-Lussier,A. Lai

Published 2016 in Summit on Clinical Research Informatics

ABSTRACT

Sentence boundary detection (SBD) is a critical preprocessing task for many natural language processing (NLP) applications. However, there has been little work on evaluating how well existing methods for SBD perform in the clinical domain. We evaluate five popular off-the-shelf NLP toolkits on the task of SBD in various kinds of text using a diverse set of corpora, including the GENIA corpus of biomedical abstracts, a corpus of clinical notes used in the 2010 i2b2 shared task, and two general-domain corpora (the British National Corpus and Switchboard). We find that, with the exception of the cTAKES system, the toolkits we evaluate perform noticeably worse on clinical text than on general-domain text. We identify and discuss major classes of errors, and suggest directions for future work to improve SBD methods in the clinical domain. We also make the code used for SBD evaluation in this paper available for download at http://github.com/drgriffis/SBD-Evaluation.

PUBLICATION RECORD

Publication year
2016
Venue
Summit on Clinical Research Informatics
Publication date
2016-07-20
Fields of study
Medicine, Computer Science
Identifiers
PMID 27570656 PMCID 5001746
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar, PubMed

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Advances in natural language processing
2015cited by this paper
Identifying Temporal Information and Tracking Sentiment in Cancer Patients' Interviews
2015cited by this paper
Detection of sentence boundaries and abbreviations in clinical narratives
2015cited by this paper
The CoNLL-2015 Shared Task on Shallow Discourse Parsing
2015cited by this paper
A Compositional Interpretation of Biomedical Event Factuality
2015cited by this paper
Risk factor detection for heart disease by applying text analytics in electronic medical records
2015cited by this paper
The Stanford CoreNLP Natural Language Processing Toolkit
2014cited by this paper
Evaluating temporal relations in clinical text: 2012 i2b2 Challenge
2013cited by this paper
Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium.
2013cited by this paper
Hybrid Text Segmentation for Hungarian Clinical Records
2013influential reference
Conversational Speech Transcription Using Context-Dependent Deep Neural Networks
2012cited by this paper
Biomedical Text Mining: A Survey of Recent Progress
2012cited by this paper
Mining electronic health records: towards better research applications and clinical care
2012cited by this paper
*SEM 2012 Shared Task: Resolving the Scope and Focus of Negation
2012cited by this paper
Sentence Boundary Detection: A Long Solved Problem?
2012influential reference
Evaluating the state of the art in coreference resolution for electronic medical records
2012cited by this paper
Data from clinical notes: a perspective on the tension between structure and flexible documentation
2011cited by this paper
2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text
2011cited by this paper
Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications
2010cited by this paper
Lancet: a high precision medication event extraction system for clinical text
2010cited by this paper
Sentence Boundary Detection and the Problem with the U.S.
2009cited by this paper
Speaker segmentation and clustering
2008cited by this paper
Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research
2008cited by this paper
LingPipe for 99.99% Recall of Gene Mentions
2007cited by this paper
BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition
2007cited by this paper
Advances on natural language processing
2007cited by this paper
Unsupervised Multilingual Sentence Boundary Detection
2006cited by this paper
Automatically Adapting an NLP Core Engine to the Biology Domain
2006cited by this paper
GENIA corpus - a semantically annotated corpus for bio-textmining
2003cited by this paper
Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program
2001cited by this paper
Sentence boundary detection: a comparison of paradigms for improving MT quality
2001cited by this paper
Tagging Sentence Boundaries
2000cited by this paper
Automatic detection of sentence boundaries and disfluencies based on recognized words
1998cited by this paper
A Maximum Entropy Approach to Identifying Sentence Boundaries
1997cited by this paper
Corpus Annotation: Linguistic Information from Computer Text Corpora
1997cited by this paper
Adaptive Multilingual Sentence Boundary Disambiguation
1997cited by this paper
Bracketing Guidelines For Treebank II Style Penn Treebank Project
1995cited by this paper
Lexical methods for managing variation in biomedical terminologies.
1994cited by this paper
CLAWS4: The Tagging of the British National Corpus
1994cited by this paper
Building a Large Annotated Corpus of English: The Penn Treebank
1993cited by this paper
Some Applications of Tree-based Modelling to Speech and Language
1989cited by this paper

CITED BY

Language Models for Standardising Clinical Notes and Information Extraction in Addiction Psychiatry-An Empirical Study.
2025cites this paper
Improving Clinical Report Classification with Sentence Boundary Detection
2025cites this paper
Multi-task transfer learning for the prediction of entity modifiers in clinical text: application to opioid use disorder case detection
2024cites this paper
Automatic sentence segmentation of clinical record narratives in real-world data
2024cites this paper
Legal sentence boundary detection using hybrid deep learning and statistical models
2024cites this paper
Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records
2023influential citation
Combining unsupervised, supervised and rule-based learning: the case of detecting patient allergies in electronic health records
2023cites this paper
The evaluation of a semi-automatic authoring tool for knowledge extraction in the AC&NL Tutor
2023cites this paper
MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
2023cites this paper
Estimating Sentence-like Structure in Synthetic Languages Using Information Topology
2022cites this paper
A survey on syntactic processing techniques
2022cites this paper
Exploring optimal granularity for extractive summarization of unstructured health records: Analysis of the largest multi-institutional archive of health records in Japan
2022cites this paper
The h-ANN Model: Comprehensive Colonoscopy Concept Compilation using Combined Contextual Embeddings
2022cites this paper
TAX-Corpus: Taxonomy based Annotations for Colonoscopy Evaluation
2022cites this paper
Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection
2021cites this paper
Reducing Physicians' Cognitive Load During Chart Review: A Problem-Oriented Summary of the Patient Electronic Record
2021cites this paper
Patient Triage by Topic Modeling of Referral Letters: Feasibility Study
2020cites this paper
FinSBD-2020: The 2nd Shared Task on Sentence Boundary Detection in Unstructured Text in the Financial Domain
2020cites this paper
Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health
2020cites this paper
Adverse Drug Events Detection in Clinical Notes by Jointly Modeling Entities and Relations Using Neural Networks
2019cites this paper
Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach
2019cites this paper
Cohort Selection From Longitudinal Patient Records: Text Mining Approach
2019cites this paper
Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings
2019cites this paper
Recurrent Deep Network Models for Clinical NLP Tasks: Use Case with Sentence Boundary Disambiguation
2019cites this paper
Sentence Boundary Detection in Legal Text
2019cites this paper
PolyU_CBS-CFA at the FinSBD Task: Sentence Boundary Detection of Financial Data with Domain Knowledge Enhancement and Bilingual Training
2019cites this paper
AIG Investments.AI at the FinSBD Task: Sentence Boundary Detection through Sequence Labelling and BERT Fine-tuning
2019cites this paper
The FinSBD-2019 Shared Task: Sentence Boundary Detection in PDF Noisy Text in the Financial Domain
2019cites this paper
Deep Neural Architectures for Discourse Segmentation in E-Mail Based Behavioral Interventions.
2019cites this paper
Application of Machine Learning Techniques in Clinical Information Extraction
2019cites this paper
Détection automatique de phrases en domaine de spécialité en français (Sentence boundary detection for specialized domains in French )
2018cites this paper
Rule-Based Method for Automatic Medical Concept Extraction from Unstructured Clinical Text
2018cites this paper
A Corpus of Corporate Annual and Social Responsibility Reports: 280 Million Tokens of Balanced Organizational Writing
2018cites this paper
CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines
2017cites this paper
Unsupervised Abbreviation Detection in Clinical Narratives
2016cites this paper