More accurate tests for the statistical significance of result differences

Published 2000 in International Conference on Computational Linguistics

ABSTRACT

Statistical significance testing of differences in values of metrics like recall, precision and balanced F-score is a necessary part of empirical natural language processing. Unfortunately, we find in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques. This underestimation comes from an independence assumption that is often violated. We point out some useful tests that do not make this assumption, including computationally-intensive randomization tests.

PUBLICATION RECORD

Publication year
2000
Venue
International Conference on Computational Linguistics
Publication date
2000-07-31
Fields of study
Computer Science
Identifiers
DOI 10.3115/992730.992783 arXiv cs/0008005
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Empirical Methods for Artificial Intelligence
1995cited by this paper
Evaluating Message Understanding Systems: An Analysis of the Third Message Understanding Conference (MUC-3)
1993cited by this paper
Introduction to the Special Issue on Computational Linguistics Using Large Corpora
1993cited by this paper
Computer Intensive Methods for Testing Hypotheses: An Introduction
1990cited by this paper
An introduction to mathematical statistics and its applications / Richard J. Larsen, Morris L. Marx
1986cited by this paper
Statistics for experimenters
1978cited by this paper
Computer methods for mathematical computations
1977cited by this paper
STATISTICAL METHODS
1967cited by this paper
Statistical Methods
1959cited by this paper

CITED BY

Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models
2026cites this paper
Robust Arabic tweet NER via label-aware data augmentation and AraBERTv2
2026cites this paper
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
2025cites this paper
Annotation and linguistic analysis of claim types for fact-checking
2025cites this paper
Integrating Semantic Representations in a Cross-Modal Approach to Fact-Checking
2025cites this paper
Predicting autism from written narratives using deep neural networks
2025cites this paper
Low-Resource Speech Recognition by Fine-Tuning Whisper with Optuna-LoRA
2025cites this paper
CourseTimeQA: A Lecture-Video Benchmark and a Latency-Constrained Cross-Modal Fusion Method for Timestamped QA
2025cites this paper
Extracting Symptoms of Complex Conditions From Online Discourse (Subreddit to Symptomatology): Lexicon-Based Approach
2025cites this paper
Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting
2024cites this paper
Modeling the Interplay Between Cohesion Dimensions: A Challenge for Group Affective Emergent States
2024cites this paper
HLC: hierarchically-aware label correlation for hierarchical text classification
2024cites this paper
ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction
2024cites this paper
Cross-lingual dependency parsing for a language with a unique script
2024cites this paper
Findings of the Quality Estimation Shared Task at WMT 2024: Are LLMs Closing the Gap in QE?
2024cites this paper
Exploiting Distant Supervision to Learn Semantic Descriptions of Tables with Overlapping Data
2024cites this paper
Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques
2024cites this paper
Transferring Sentiment Cross-Lingually within and across Same-Family Languages
2024cites this paper
New insights into the effects of type and timing of childhood maltreatment on brain morphometry
2024cites this paper
Pulmonary nodule detection in low dose computed tomography using a medical-to-medical transfer learning approach
2024influential citation
Assessing Large Language Models for Oncology Data Inference From Radiology Reports
2024cites this paper
Cross-functional Analysis of Generalization in Behavioral Learning
2023influential citation
LLMs Accelerate Annotation for Medical Information Extraction
2023cites this paper
A Survey of MWE Identification Experiments: The Devil is in the Details
2023cites this paper
Functionality learning through specification instructions
2023cites this paper
Findings of the WMT 2023 Shared Task on Quality Estimation
2023cites this paper
End-to-End Learning on Multimodal Knowledge Graphs
2023cites this paper
Interrelated feature selection from health surveys using domain knowledge graph
2023cites this paper
Few Labels are Enough! Semi-supervised Graph Learning for Social Interaction
2023cites this paper
Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes
2023cites this paper
Analyzing the Generalizability of Deep Contextualized Language Representations For Text Classification
2023cites this paper
Deep-Fuzz: A synergistic integration of deep learning and fuzzy water flows for fine-grained nuclei segmentation in digital pathology
2023cites this paper
Findings of the WMT 2022 Shared Task on Quality Estimation
2022influential citation
Training Computational Models of Group Processes without Groundtruth: the Self- vs External Assessment’s Dilemma
2022cites this paper
Exact Paired Permutation Testing Algorithms for NLP Systems
2022cites this paper
Comparing neural models for nested and overlapping biomedical event detection
2022influential citation
Traffic sign extraction using deep hierarchical feature learning and mobile light detection and ranging (LiDAR) data on rural highways
2022cites this paper
Combined Use of Glucose-Specific Model Identification and Alarm Strategy Based on Prediction-Funnel to Improve Online Forecasting of Hypoglycemic Events
2022cites this paper
The Web We Weave: Untangling the Social Graph of the IETF
2022cites this paper
A matrix factorization model with local and global consistency for flow prediction in bike-sharing systems
2022cites this paper
Identifying Hate Speech Using Neural Networks and Discourse Analysis Techniques
2022cites this paper
Double Retrieval and Ranking for Accurate Question Answering
2022cites this paper
Beyond belief: a cross-genre study on perception and validation of health information online
2022cites this paper
Checking HateCheck: a cross-functional analysis of behaviour-aware learning for hate speech detection
2022cites this paper
Conventional clustering-based method for event detection on social networks
2022cites this paper
Exact Paired-Permutation Testing for Structured Test Statistics
2022cites this paper
Data Centric Domain Adaptation for Historical Text with OCR Errors
2021cites this paper
Étude de l'influence des représentations textuelles sur la détection d'évènements non supervisée dans des flux de données
2021cites this paper
Joint Models for Answer Verification in Question Answering Systems
2021cites this paper
Exploiting the Interplay between Social and Task Dimensions of Cohesion to Predict its Dynamics Leveraging Social Sciences
2021influential citation
KnowMAN: Weakly Supervised Multinomial Adversarial Networks
2021cites this paper
A hierarchical and parallel framework for End-to-End Aspect-based Sentiment Analysis
2021cites this paper
To Scale or Not to Scale: Comparing Popular Sentiment Analysis Dictionaries on Educational Twitter Data
2021cites this paper
Findings of the WMT 2021 Shared Task on Quality Estimation
2021cites this paper
Hierarchical BERT with an adaptive fine-tuning strategy for document classification
2021cites this paper
Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmark
2021influential citation
Research and Applications The 2019 n2c2/UMass Lowell shared task on clinical concept normalization
2021cites this paper
Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models
2021cites this paper
Narrative Text Generation from Abductive Interpretations Using Axiom-Specific Templates
2021cites this paper
The Impact of Word Embeddings on Neural Dependency Parsing
2021cites this paper
An Exploratory Computational Study on the Effect of Emergent Leadership on Social and Task Cohesion
2021cites this paper
Enhancing Siamese Neural Networks Through Expert Knowledge for Predictive Maintenance
2020cites this paper
Emote-Controlled
2020cites this paper
A neural network-based joint learning approach for biomedical entity and relation extraction from biomedical literature
2020cites this paper
How Should Markup Tags Be Translated?
2020cites this paper
Training for Gibbs Sampling on Conditional Random Fields with Neural Scoring Factors
2020cites this paper
Two-Level Transformer and Auxiliary Coherence Modeling for Improved Text Segmentation
2020influential citation
The Gap on GAP: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets
2020cites this paper
Cluster-based mention typing for named entity disambiguation
2020cites this paper
Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records
2020cites this paper
Event Extraction as Machine Reading Comprehension
2020influential citation
A Systematic Comparison of Architectures for Document-Level Sentiment Classification
2020cites this paper
Analysing terminology translation errors in statistical and neural machine translation
2020cites this paper
Bayesian Analysis of Three Methods for Diagnosis of Cystic Echinococcosis in Sheep
2020cites this paper
Syntax-Informed Interactive Neural Machine Translation
2020cites this paper
Investigating Query Expansion and Coreference Resolution in Question Answering on BERT
2020cites this paper
Match²: A Matching over Matching Model for Similar Question Identification
2020cites this paper
Analyzing ELMo and DistilBERT on Socio-political News Classification
2020cites this paper
A graph-based method for reconstructing entities from coordination ellipsis in medical text
2020cites this paper
Classification-Based Self-Learning for Weakly Supervised Bilingual Lexicon Induction
2020influential citation
Contextual Modulation for Relation-Level Metaphor Identification
2020influential citation
Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces
2020cites this paper
Statistical Significance Testing for Natural Language Processing
2020cites this paper
Obtaining Faithful Interpretations from Compositional Neural Networks
2020cites this paper
Natural Language Processing and Information Systems: 25th International Conference on Applications of Natural Language to Information Systems, NLDB 2020, Saarbrücken, Germany, June 24–26, 2020, Proceedings
2020cites this paper
The Effect of Sociocultural Variables on Sarcasm Communication Online
2020cites this paper
Integrating lexical and prosodic features for automatic paragraph segmentation
2020cites this paper
Information Search, Integration, and Personalization: 13th International Workshop, ISIP 2019, Heraklion, Greece, May 9–10, 2019, Revised Selected Papers
2020cites this paper
A Better Set of Object-Oriented Design Metrics for Within-Project Defect Prediction
2020cites this paper
Modelling Source- and Target- Language Syntactic Information as Conditional Context in Interactive Neural Machine Translation
2020cites this paper
The RELX Dataset and Matching the Multilingual Blanks for Cross-lingual Relation Classification
2020cites this paper
How Many Times Should a Pedagogical Agent Simulation Model Be Run?
2019cites this paper
Modelling disease risk for amyloid A (AA) amyloidosis in non-human primates using machine learning
2019cites this paper
Research and Applications 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records
2019cites this paper
Embedding Projection for Targeted Cross-Lingual Sentiment: Model Comparisons and a Real-World Study
2019influential citation
Combining Discourse Markers and Cross-lingual Embeddings for Synonym–Antonym Classification
2019influential citation
Automatic Structured Text Summarization with Concept Maps
2019cites this paper
Cross-lingual Transfer Learning for Japanese Named Entity Recognition
2019influential citation
Multi-Task Learning for Coherence Modeling
2019cites this paper
EEG-Based Decoding of Auditory Attention to a Target Instrument in Polyphonic Music
2019cites this paper