There’s No Comparison: Reference-less Evaluation Metrics in Grammatical Error Correction

Courtney Napoles,Keisuke Sakaguchi,Joel R. Tetreault

Published 2016 in Conference on Empirical Methods in Natural Language Processing

ABSTRACT

Current methods for automatically evaluating grammatical error correction (GEC) systems rely on gold-standard references. However, these methods suffer from penalizing grammatical edits that are correct but not in the gold standard. We show that reference-less grammaticality metrics correlate very strongly with human judgments and are competitive with the leading reference-based evaluation metrics. By interpolating both methods, we achieve state-of-the-art correlation with human judgments. Finally, we show that GEC metrics are much more reliable when they are calculated at the sentence level instead of the corpus level. We have set up a CodaLab site for benchmarking GEC output using a common dataset and different evaluation metrics.

PUBLICATION RECORD

Publication year
2016
Venue
Conference on Empirical Methods in Natural Language Processing
Publication date
2016-10-07
Fields of study
Linguistics, Computer Science
Identifiers
DOI 10.18653/v1/D16-1228 arXiv 1610.02124
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Reassessing the Goals of Grammatical Error Correction: Fluency Instead of Grammaticality
2016cited by this paper
Exploring Prediction Uncertainty in Machine Translation Quality Estimation
2016cited by this paper
Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics
2016cited by this paper
Findings of the 2015 Workshop on Statistical Machine Translation
2015cited by this paper
Human Evaluation of Grammatical Error Correction Systems
2015cited by this paper
Towards a standard evaluation method for grammatical error detection and correction
2015cited by this paper
Ground Truth for Grammatical Error Correction Metrics
2015cited by this paper
How Far are We from Fully Automatic High Quality Grammatical Error Correction?
2015cited by this paper
The CoNLL-2014 Shared Task on Grammatical Error Correction
2014influential reference
Findings of the 2014 Workshop on Statistical Machine Translation
2014cited by this paper
System Combination for Grammatical Error Correction
2014cited by this paper
Predicting Grammaticality on an Ordinal Scale
2014cited by this paper
QuEst - A translation quality estimation framework
2013cited by this paper
The CoNLL-2013 Shared Task on Grammatical Error Correction
2013cited by this paper
Findings of the 2013 Workshop on Statistical Machine Translation
2013cited by this paper
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
2012cited by this paper
HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task
2012cited by this paper
Better Evaluation for Grammatical Error Correction
2012cited by this paper
Findings of the 2012 Workshop on Statistical Machine Translation
2012cited by this paper
Helping Our Own: The HOO 2011 Pilot Shared Task
2011cited by this paper
E-rating Machine Translation
2011cited by this paper
Annotating ESL Errors: Challenges and Rewards
2010cited by this paper
Developing an open‐source, rule‐based proofreading tool
2010influential reference
Extending applications using an advanced approach to DLL injection and API hooking
2010cited by this paper
Machine translation evaluation versus quality estimation
2010cited by this paper
Native Judgments of Non-Native Usage: Experiments in Preposition Error Detection
2008cited by this paper
How Does Automatic Machine Translation Evaluation Correlate with Human Scoring as the Number of Reference Translations Increases?
2004cited by this paper
AUTOMATED ESSAY SCORING WITH E‐RATER® V.2.0
2004cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002influential reference
of the Association for Computational Linguistics
year unknowncited by this paper

CITED BY

Opportunities and Challenges of LLMs in Education: An NLP Perspective
2025cites this paper
Differentially-private text generation degrades output language quality
2025cites this paper
PIE: Performance Interval Estimation for Free-Form Generation Tasks
2025cites this paper
Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction
2025cites this paper
Revisiting Meta-evaluation for Grammatical Error Correction
2024cites this paper
n-gram F-score for Evaluating Grammatical Error Correction
2024influential citation
Data Augmentation for SentRev using Back-Translation of Lexical Bundles
2023cites this paper
A Holistic Approach to Reference-Free Evaluation of Machine Translation
2023cites this paper
Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation
2023cites this paper
CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction
2023cites this paper
How to choose "Good" Samples for Text Data Augmentation
2023cites this paper
An Analysis of GPT-3's Performance in Grammatical Error Correction
2023cites this paper
Exploiting Paraphrasers and Inverse Paraphrasers: A Novel Approach to Enhance English Writing Fluency through Improved Style Transfer Training Data
2023cites this paper
Towards Automated Document Revision: Grammatical Error Correction, Fluency Edits, and Beyond
2022cites this paper
Grammatical Error Correction: A Survey of the State of the Art
2022influential citation
Revisiting Grammatical Error Correction Evaluation and Beyond
2022influential citation
Universal Evasion Attacks on Summarization Scoring
2022cites this paper
IMPARA: Impact-Based Metric for GEC Using Parallel Data
2022cites this paper
Master Thesis Multidomain Story Generation using pre-trained Language Models
2022cites this paper
Construction of a Quality Estimation Dataset for Automatic Evaluation of Japanese Grammatical Error Correction
2022cites this paper
Evaluating the Morphosyntactic Well-formedness of Generated Texts
2021influential citation
On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems
2021cites this paper
A Comprehensive Survey of Grammatical Error Correction
2021cites this paper
Is this the end of the gold standard? A straightforward reference-less grammatical error correction metric
2021cites this paper
LM-Critic: Language Models for Unsupervised Grammatical Error Correction
2021cites this paper
Optimization of Reference-less Evaluation Metric of Grammatical Error Correction for Manual Evaluations
2021cites this paper
SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis
2021cites this paper
Cross-Sectional Evaluation of Grammatical Error Correction Models
2021cites this paper
Assessing Reference-Free Peer Evaluation for Machine Translation
2021cites this paper
SOME: Reference-less Sub-Metrics Optimized for Manual Evaluations of Grammatical Error Correction
2020cites this paper
Towards Minimal Supervision BERT-based Grammar Error Correction
2020cites this paper
BLEU Neighbors: A Reference-less Approach to Automatic Evaluation
2020cites this paper
A Comprehensive Survey of Grammar Error Correction
2020cites this paper
Construction of an Evaluation Corpus for Grammatical Error Correction for Learners of Japanese as a Second Language
2020cites this paper
Intrinsic Evaluation of Summarization Datasets
2020cites this paper
An Analysis of Source-Side Grammatical Errors in NMT
2019influential citation
Many shades of grammar checking – Launching a Constraint Grammar tool for North Sámi
2019cites this paper
Proceedings of the NoDaLiDa 2019 Workshop on Constraint Grammar-Methods, Tools and Applications
2019cites this paper
Diamonds in the Rough: Generating Fluent Sentences from Early-Stage Drafts for Academic Writing Assistance
2019cites this paper
Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)
2019cites this paper
Enabling Robust Grammatical Error Correction in New Domains: Data Sets, Metrics, and Analyses
2019cites this paper
On conducting better validation studies of automatic metrics in natural language generation evaluation
2019cites this paper
Automatic annotation of error types for grammatical error correction
2019cites this paper
A Partially Rule-Based Approach to AMR Generation
2019cites this paper
Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer
2018cites this paper
A Reference-less Evaluation Metric Based on Grammaticality, Fluency, and Meaning Preservation in Grammatical Error Correction
2018influential citation
Distilling crowd knowledge from software-specific Q&A discussions for assisting developers' knowledge search
2018cites this paper
Treat the system like a human student: Automatic naturalness evaluation of generated text without reference texts
2018cites this paper
Neural Quality Estimation of Grammatical Error Correction
2018influential citation
A Reassessment of Reference-Based Grammatical Error Correction Metrics
2018influential citation
Fluency Boost Learning and Inference for Neural Grammatical Error Correction
2018cites this paper
Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study
2018cites this paper
Dear Sir or Madam, May I introduce the YAFC Corpus: Corpus, Benchmarks and Metrics for Formality Style Transfer
2018cites this paper
Inherent Biases in Reference-based Evaluation for Grammatical Error Correction
2018cites this paper
Automatic Metric Validation for Grammatical Error Correction
2018influential citation
Reference-less Measure of Faithfulness for Grammatical Error Correction
2018cites this paper
Referenceless Quality Estimation for Natural Language Generation
2017influential citation
By the Community & For the Community
2017cites this paper
Reference-based Metrics can be Replaced with Reference-less Metrics in Evaluating Grammatical Error Correction Systems
2017influential citation
Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning
2017cites this paper
Why We Need New Evaluation Metrics for NLG
2017influential citation
Forage: Optimizing Food Use With Machine Learning Generated Recipes
2017cites this paper
The MultiGEC-2025 Shared Task on Multilingual Grammatical Error Correction at NLP4CALL
year unknowninfluential citation