arxiv_cl 95% Match Research Paper MT researchers,NLP evaluation specialists,Linguists,Developers of translation tools 1 week ago

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

large-language-models › evaluation

📄 Abstract

Abstract: Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

Authors (6)

Parker Riley

Daniel Deutsch

Mara Finkelstein

Colten DiIanni

Juraj Juraska

Markus Freitag

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper proposes and evaluates MQM re-annotation, a two-stage human evaluation process for machine translation where annotators review and edit existing MQM annotations. The study demonstrates that this re-annotation process leads to higher-quality annotations by identifying errors missed in the initial pass, thereby reducing evaluation noise and providing a more reliable measure of translation quality.

Business Value

More reliable and accurate evaluation of machine translation systems is crucial for businesses relying on MT for localization, content creation, and customer support, ensuring that quality improvements are accurately measured and that systems meet business needs.

Paper Metadata

Innovation Type

Novel Evaluation Methodology

Deployment Feasibility

High, as it's a methodological improvement for existing evaluation processes.

Limitations Addressed

The 'arms race' between improving MT models and the noise/limitations of current evaluation methods, specifically MQM.

Performance Gains

Results in higher-quality annotations, mostly due to finding errors missed during the first pass.

Technical Tags

Machine Translation (MT)Human EvaluationMQM (Multidimensional Quality Metrics)AnnotationRe-annotationQuality AssessmentEvaluation NoiseError DetectionRater BehaviorTranslation Quality

Research Topics

Machine TranslationNatural Language ProcessingEvaluation MetricsHuman-Computer InteractionComputational Linguistics

Methods & Architectures

MQM re-annotationRater analysisComparative evaluation

Applications & Tasks

Machine Translation Systems Language Technology Evaluation Improving the reliability of MT evaluationReducing noise in human annotationsDeveloping better metrics for translation quality Machine translation evaluationQuality assessment of translated text

Related Fields

Natural Language ProcessingComputational LinguisticsHuman-Computer InteractionLinguisticsData Annotation

Keywords

machine translationevaluationMQMhuman annotationre-annotationtranslation qualityerror detectionNLP evaluationlinguistic qualityannotator agreementevaluation metrics

Academic Context

#Machine Translation#Natural Language Processing#Evaluation Metrics#Human-Computer Interaction#Computational Linguistics

Commercial Potential

Potential Products

Improved MT evaluation platformsAnnotation services with enhanced quality control

Target Industries

TechnologyLocalizationPublishingMedia

Use Case Examples

A company using MQM re-annotation to get a more accurate assessment of their MT system's performance before deployment.Researchers using this method to validate new MT models with higher confidence in the evaluation results.

Competitive Edge

Offers a more robust and reliable method for human evaluation of MT compared to single-pass annotation, addressing a key bottleneck in MT development.

Resource Requirements

Compute Needs

Standard computing for data analysis.

Data Requirements

Existing MQM annotations and translated texts.

Deployment Constraints

Requires trained annotators familiar with MQM guidelines.

Scalability

The methodology is scalable to large volumes of translations and annotations.

Production Readiness

Maturity Level

Methodology Development

View Full Paper Back to Papers