Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 90% Match Research Paper MT researchers,NLP researchers,Developers of LLMs,AI safety researchers 1 week ago

Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

large-language-models › evaluation
📄 Abstract

Abstract: Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.
Authors (8)
Xinwei Wu
Heng Liu
Jiang Zhou
Xiaohu Zhao
Linlong Xu
Longyue Wang
+2 more
Submitted
October 28, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper introduces HalloMTBench, a new benchmark and diagnostic framework for evaluating hallucinations in multilingual LLMs for machine translation. It provides a taxonomy of hallucination types and identifies key 'triggers' related to model scale, source length, and biases.

Business Value

Enables more reliable development and deployment of multilingual translation systems by providing a rigorous method to identify and mitigate critical failure modes like hallucinations.