Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In this paper, we introduce PolyMath, a multilingual mathematical reasoning
benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our
benchmark ensures difficulty comprehensiveness, language diversity, and
high-quality translation, making it a highly discriminative multilingual
mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive
evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and
Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40%
accuracy under the highest level From a language perspective, our benchmark
reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning
performance varies widely across languages for current LLMs; (2) Input-output
language consistency is low in reasoning LLMs and may be correlated with
performance; (3) The thinking length differs significantly by language for
current LLMs. Additionally, we demonstrate that controlling the output language
in the instructions has the potential to affect reasoning performance,
especially for some low-resource languages, suggesting a promising direction
for improving multilingual capabilities in LLMs.
Authors (16)
Yiming Wang
Pei Zhang
Jialong Tang
Haoran Wei
Baosong Yang
Rui Wang
+10 more
Key Contributions
Introduced PolyMath, a comprehensive multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels. This benchmark is designed to be highly discriminative for evaluating reasoning LLMs, addressing the need for language diversity and difficulty comprehensiveness in current evaluations. The evaluation revealed significant performance variations across languages and highlighted issues with input-output language consistency and thinking length in LLMs.
Business Value
Enables more accurate and fair evaluation of LLMs for global applications, leading to better development of AI systems that can understand and reason mathematically across different languages and cultures.