Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Language Models (LLMs) have advanced machine translation but remain
vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not
capable of exposing failures in multilingual LLMs. To disclose hallucination in
multilingual LLMs, we introduce a diagnostic framework with a taxonomy that
separates Instruction Detachment from Source Detachment. Guided by this
taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark
across 11 English-to-X directions. We employed 4 frontier LLMs to generate
candidates and scrutinize these candidates with an ensemble of LLM judges, and
expert validation. In this way, we curate 5,435 high-quality instances. We have
evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination
triggers'' -- unique failure patterns reflecting model scale, source length
sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified
language mixing. HalloMTBench offers a forward-looking testbed for diagnosing
LLM translation failures. HalloMTBench is available in
https://huggingface.co/collections/AIDC-AI/marco-mt.
Authors (8)
Xinwei Wu
Heng Liu
Jiang Zhou
Xiaohu Zhao
Linlong Xu
Longyue Wang
+2 more
Submitted
October 28, 2025
Key Contributions
This paper introduces HalloMTBench, a new benchmark and diagnostic framework for evaluating hallucinations in multilingual LLMs for machine translation. It provides a taxonomy of hallucination types and identifies key 'triggers' related to model scale, source length, and biases.
Business Value
Enables more reliable development and deployment of multilingual translation systems by providing a rigorous method to identify and mitigate critical failure modes like hallucinations.