arxiv_cl 90% Match Research Paper MT researchers,NLP researchers,Developers of LLMs,AI safety researchers 1 week ago

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

large-language-models › evaluation

📄 Abstract

Abstract: Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

Authors (8)

Xinwei Wu

Heng Liu

Jiang Zhou

Xiaohu Zhao

Linlong Xu

Longyue Wang

+2 more

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces HalloMTBench, a new benchmark and diagnostic framework for evaluating hallucinations in multilingual LLMs for machine translation. It provides a taxonomy of hallucination types and identifies key 'triggers' related to model scale, source length, and biases.

Business Value

Enables more reliable development and deployment of multilingual translation systems by providing a rigorous method to identify and mitigate critical failure modes like hallucinations.

Paper Metadata

Innovation Type

Benchmark Creation & Analysis

Deployment Feasibility

High, as it provides a methodology and benchmark for evaluation, not a deployable system itself.

Limitations Addressed

Existing MT benchmarks are insufficient for exposing failures in multilingual LLMs, particularly regarding hallucinations.

Technical Tags

Multilingual LLMsMachine TranslationHallucinationsDiagnostic FrameworkTaxonomyHalloMTBenchLLM JudgesReinforcement Learning (RL)Linguistic BiasesModel ScaleSource Length Sensitivity

Research Topics

Machine Translation EvaluationLLM HallucinationsMultilingual NLPBenchmark DevelopmentAI Safety

Methods & Architectures

Diagnostic framework with taxonomyHalloMTBench creationLLM judges for evaluationExpert validationAnalysis of hallucination triggers Multilingual Large Language Models (LLMs)

Applications & Tasks

Machine Translation Natural Language Processing Multilingual AI Exposing LLM failures in translationQuantifying hallucinationsUnderstanding hallucination triggersEvaluating multilingual capabilities Machine TranslationEvaluating LLM hallucinationBenchmarking multilingual models

Datasets & Benchmarks

Datasets

HalloMTBench

Benchmarks

HalloMTBench

Related Fields

Computational LinguisticsMachine TranslationNatural Language ProcessingArtificial IntelligenceAI Ethics

Keywords

Machine TranslationMultilingual LLMsHallucinationsBenchmarkEvaluationLLM FailuresAI SafetyLinguistic BiasReinforcement LearningModel Scale

Academic Context

#Machine Translation Evaluation#LLM Hallucinations#Multilingual NLP#Benchmark Development#AI Safety

Commercial Potential

Potential Products

Evaluation services for multilingual LLMsTools for detecting and mitigating translation hallucinationsDatasets for training robust MT models

Target Industries

TechnologySoftware DevelopmentLocalization ServicesGlobal Communications

Use Case Examples

Testing the reliability of a new LLM for translating sensitive legal documents.Identifying why a translation model produces nonsensical output for certain language pairs.

Competitive Edge

Addresses a critical gap in evaluating multilingual LLMs by focusing specifically on translation hallucinations, offering a more targeted and insightful benchmark than general NLP evaluation suites.

Market Opportunity

Growing market for high-quality machine translation and multilingual AI solutions.

Revenue Models

Licensing of the benchmark or evaluation services.

Resource Requirements

Compute Needs

Moderate, for running evaluations and LLM judges.

Data Requirements

Requires parallel corpora for translation tasks and curated data for hallucination examples.

Deployment Constraints

The benchmark requires careful curation and maintenance. LLM judges can introduce their own biases.

Scalability

The benchmark can be extended to more languages and translation directions. The taxonomy can be applied to other NLP tasks.

Regulatory Considerations

Ensuring fairness and avoiding biases in the benchmark data and evaluation process.

Production Readiness

Maturity Level

Research/Development

Time to Market

N/A (evaluation benchmark).

Patent Potential

Low, primarily focused on benchmark creation and analysis methodology.

View Full Paper Back to Papers