ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation

Published 2004 in International Conference on Computational Linguistics

ABSTRACT

Comparisons of automatic evaluation metrics for machine translation are usually conducted on corpus level using correlation statistics such as Pearson's product moment correlation coefficient or Spearman's rank order correlation coefficient between human scores and automatic scores. However, such comparisons rely on human judgments of translation qualities such as adequacy and fluency. Unfortunately, these judgments are often inconsistent and very expensive to acquire. In this paper, we introduce a new evaluation method, Orange, for evaluating automatic machine translation evaluation metrics automatically without extra human involvement other than using a set of reference translations. We also show the results of comparing several existing automatic metrics and three new automatic metrics using Orange.

PUBLICATION RECORD

Publication year
2004
Venue
International Conference on Computational Linguistics
Publication date
2004-08-23
Fields of study
Computer Science
Identifiers
DOI 10.3115/1220355.1220427
External record
Open on Semantic Scholar
Source metadata
Semantic Scholar

CITATION MAP

EXTRACTION MAP

CLAIMS

No claims are published for this paper.

CONCEPTS

No concepts are published for this paper.

REFERENCES

Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics
2004cited by this paper
A novel string-to-string distance measure with applications to machine translation evaluation
2003cited by this paper
Minimum Error Rate Training in Statistical Machine Translation
2003influential reference
Evaluation of machine translation and its evaluation
2003cited by this paper
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics
2002cited by this paper
Bleu: a Method for Automatic Evaluation of Machine Translation
2002cited by this paper
Using multiple edit distances to automatically rank machine translation output
2001influential reference
An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research
2000cited by this paper
Bootstrap Methods and Their Application
1998cited by this paper
Introduction to algorithms
1996cited by this paper
A New Quantitative Quality Measure for Machine Translation Systems
1992cited by this paper

CITED BY

Enhancing Parameter-Efficient Code Representations with Retrieval and Structural Priors
2026cites this paper
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
2026cites this paper
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts
2026cites this paper
Readability-Robust Code Summarization via Meta Curriculum Learning
2026cites this paper
BrainDEC: A Multimodal LLM for the Non-Invasive Decoding of Text from Brain Recordings
2025cites this paper
CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
2025cites this paper
Energy-Conscious LLM Decoding: Impact of Text Generation Strategies on GPU Energy Consumption
2025cites this paper
When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset
2025cites this paper
AI-enhanced structured dataset building for generating meaningful knowledge relationships and diversified questions
2025cites this paper
IDEAlign: Comparing Large Language Models to Human Experts in Open-ended Interpretive Annotations
2025cites this paper
PatentVision: A multimodal method for drafting patent applications
2025cites this paper
Smart but Costly? Benchmarking LLMs on Functional Accuracy and Energy Efficiency
2025cites this paper
Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning
2025cites this paper
Shapley Uncertainty in Natural Language Generation
2025influential citation
UniGenCoder: Merging SEQ2SEQ and SEQ2TREE Paradigms for Unified Code Generation
2025cites this paper
Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism
2025cites this paper
QueryBandits for Hallucination Mitigation: Exploiting Semantic Features for No-Regret Rewriting
2025cites this paper
Single-cell omics arena: evaluation of large language models for automatic cell-type annotations on single-cell omics data via RNA-seq bridging
2025cites this paper
Fin-Ally: Pioneering the Development of an Advanced, Commonsense-Embedded Conversational AI for Money Matters
2025cites this paper
Vision-language model for report generation and outcome prediction in CT pulmonary angiogram
2025cites this paper
An Empirical Study of Exploring the Capabilities of Large Language Models in Code Learning
2025cites this paper
Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPA
2025cites this paper
Performance Analysis of LLMs for Abstractive Summarization of Brazilian Legislative Documents
2025influential citation
Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?
2025cites this paper
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
2025cites this paper
Advancements in legal text summarization: integrating InLegalBERT for effective extractive summarization
2025cites this paper
HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection
2025cites this paper
Machine Translation vs. Human Translation: A Linguistic Analysis
2025cites this paper
Trainable Reference-Based Evaluation Metric for Identifying Quality of English-Gujarati Machine Translation System
2025cites this paper
LAG-MMLU: Benchmarking Frontier LLM Understanding in Latvian and Giriama
2025cites this paper
Interpreting Performance Profiles with Deep Learning
2025influential citation
Incorporating Curriculum Learning into Iterative Back-Translation for Neural Machine Translation
2025cites this paper
CoRAC: Integrating Selective API Document Retrieval with Question Semantic Intent for Code Question Answering
2025cites this paper
Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing
2025influential citation
FeaMix: Feature Mix With Memory Batch Based on Self-Consistency Learning for Code Generation and Code Translation
2025cites this paper
Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs
2024influential citation
Towards Cost-effective Multi-style Conversations: A Pilot Study in Task-oriented Dialogue Generation
2024cites this paper
Towards automatic question generation using pre-trained model in academic field for Bahasa Indonesia
2024cites this paper
Large Language Models for Mobile GUI Text Input Generation: An Empirical Study
2024cites this paper
VTechAGP: An Academic-to-General-Audience Text Paraphrase Dataset and Benchmark Models
2024influential citation
Do Current Language Models Support Code Intelligence for R Programming Language?
2024cites this paper
A novel backdoor scenario target the vulnerability of Prompt-as-a-Service for code intelligence models
2024cites this paper
How Effectively Do Code Language Models Understand Poor-Readability Code?
2024cites this paper
BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding
2024cites this paper
APT: Adaptive Prefix-Tuning on Pretrained Models for Code Intelligence
2024cites this paper
X-Lifecycle Learning for Cloud Incident Management using LLMs
2024cites this paper
Advancing Chart Question Answering with Robust Chart Component Recognition
2024cites this paper
DiaVio: LLM-Empowered Diagnosis of Safety Violations in ADS Simulation Testing
2024cites this paper
IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators
2024cites this paper
Multi-modal Learning for WebAssembly Reverse Engineering
2024cites this paper
Enhancing Critical Thinking in Education by means of a Socratic Chatbot
2024cites this paper
Building a Coding Assistant via the Retrieval-Augmented Language Model
2024cites this paper
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
2024cites this paper
Automated Multi-Language to English Machine Translation Using Generative Pre-Trained Transformers
2024cites this paper
Dermacen Analytica: A Novel Methodology Integrating Multi-Modal Large Language Models with Machine Learning in tele-dermatology
2024cites this paper
NOTE: Notable generation Of patient Text summaries through Efficient approach based on direct preference optimization
2024cites this paper
SparseCoder: Identifier-Aware Sparse Transformer for File- Level Code Summarization
2024cites this paper
A multimodal LLM for the non-invasive decoding of spoken text from brain recordings
2024cites this paper
HICEScore: A Hierarchical Metric for Image Captioning Evaluation
2024cites this paper
Studying LLM Performance on Closed- and Open-source Data
2024cites this paper
Reduce Redundancy Then Rerank: Enhancing Code Summarization with a Novel Pipeline Framework
2024cites this paper
Comparison of Translation Quality between Large Language Models and Neural Machine Translation Systems: A Case Study of Chinese-English Language Pair
2024cites this paper
Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications
2024cites this paper
Predicting Winning Captions for Weekly New Yorker Comics
2024cites this paper
Health Text Simplification: An Annotated Corpus for Digestive Cancer Education and Novel Strategies for Reinforcement Learning
2024cites this paper
Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms
2024cites this paper
ViMedAQA: A Vietnamese Medical Abstractive Question-Answering Dataset and Findings of Large Language Model
2024cites this paper
Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
2024cites this paper
GrammarT5: Grammar-Integrated Pretrained Encoder-Decoder Neural Model for Code
2024cites this paper
Large Language Models for Code Summarization
2024cites this paper
Structure-aware Fine-tuning for Code Pre-trained Models
2024cites this paper
Evaluating and Enhancing the Robustness of Code Pre-trained Models through Structure-Aware Adversarial Samples Generation
2023cites this paper
Prompt Valuation Based on Shapley Values
2023cites this paper
DocChecker: Bootstrapping Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies
2023cites this paper
Greener yet Powerful: Taming Large Code Generation Models with Quantization
2023cites this paper
Learning Deep Semantics for Test Completion
2023cites this paper
TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills
2023cites this paper
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
2023cites this paper
Impact of Code Language Models on Automated Program Repair
2023influential citation
Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review
2023cites this paper
Model-Agnostic Syntactical Information for Pre-Trained Programming Language Models
2023cites this paper
Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)
2023cites this paper
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements
2023cites this paper
Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study
2023cites this paper
AttSum: A Deep Attention-Based Summarization Model for Bug Report Title Generation
2023cites this paper
Visual Story Generation Based on Emotion and Keywords
2023cites this paper
Towards Greener Yet Powerful Code Generation via Quantization: An Empirical Study
2023cites this paper
Are NLP Models Good at Tracing Thoughts: An Overview of Narrative Understanding
2023cites this paper
Program Repair with Minimal Edits Using CodeT5
2023cites this paper
Stealthy Backdoor Attack for Code Models
2023cites this paper
Better Language Models of Code through Self-Improvement
2023cites this paper
Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
2023cites this paper
Invited: Automated Code generation for Information Technology Tasks in YAML through Large Language Models
2023cites this paper
A review of existing Machine Translation Approaches, their Challenges and Evaluation Metrics
2023cites this paper
VisText: A Benchmark for Semantically Rich Chart Captioning
2023cites this paper
Utilization of pre-trained language models for adapter-based knowledge transfer in software engineering
2023cites this paper
FQN Inference in Partial Code by Prompt-tuned Language Model of Code
2023cites this paper
A multi-domain adaptive neural machine translation method based on domain data balancer
2023cites this paper
Pass-Tuning: Towards Structure-Aware Parameter-Efficient Tuning for Code Representation Learning
2023cites this paper
On-the-Fly Adapting Code Summarization on Trainable Cost-Effective Language Models
2023cites this paper