arxiv_cl 95% Match Research Paper AI Researchers,LLM Developers,NLP Engineers,Linguists 1 day ago

PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts

large-language-models › evaluation

📄 Abstract

Abstract: In this paper, we introduce PolyMath, a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs. We conduct a comprehensive evaluation for advanced LLMs and find that even Qwen-3-235B-A22B-Thinking and Gemini-2.5-pro, achieve only 54.6 and 52.2 benchmark scores, with about 40% accuracy under the highest level From a language perspective, our benchmark reveals several key challenges of LLMs in multilingual reasoning: (1) Reasoning performance varies widely across languages for current LLMs; (2) Input-output language consistency is low in reasoning LLMs and may be correlated with performance; (3) The thinking length differs significantly by language for current LLMs. Additionally, we demonstrate that controlling the output language in the instructions has the potential to affect reasoning performance, especially for some low-resource languages, suggesting a promising direction for improving multilingual capabilities in LLMs.

Authors (16)

Yiming Wang

Pei Zhang

Jialong Tang

Haoran Wei

Baosong Yang

Rui Wang

+10 more

Submitted

April 25, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduced PolyMath, a comprehensive multilingual mathematical reasoning benchmark covering 18 languages and 4 difficulty levels. This benchmark is designed to be highly discriminative for evaluating reasoning LLMs, addressing the need for language diversity and difficulty comprehensiveness in current evaluations. The evaluation revealed significant performance variations across languages and highlighted issues with input-output language consistency and thinking length in LLMs.

Business Value

Enables more accurate and fair evaluation of LLMs for global applications, leading to better development of AI systems that can understand and reason mathematically across different languages and cultures.

Paper Metadata

Innovation Type

Benchmark Creation

Deployment Feasibility

High, as it's a benchmark for evaluation, not a deployable model.

Limitations Addressed

Lack of comprehensive multilingual mathematical reasoning benchmarks, difficulty in assessing LLM reasoning across diverse languages, and limited understanding of language-specific challenges in LLM reasoning.

Technical Tags

multilingualmathematical reasoningLLM evaluationbenchmarklanguage diversitydifficulty levelstranslation qualityreasoning performancelanguage consistencythinking length

Research Topics

Multilingual Natural Language ProcessingLarge Language Model EvaluationMathematical ReasoningCross-lingual Transfer LearningLLM Performance Analysis

Methods & Architectures

Benchmark CreationLLM EvaluationQuantitative Analysis LLM

Applications & Tasks

Education Research Natural Language Processing Evaluating Mathematical ReasoningAssessing Multilingual CapabilitiesBenchmarking LLM Performance Mathematical ReasoningMultilingual Instruction Following

Datasets & Benchmarks

Datasets

PolyMath

Benchmarks

PolyMath: 54.6 (Qwen-3-235B-A22B-Thinking), 52.2 (Gemini-2.5-pro)

Benchmark ScoreAccuracyReasoning PerformanceLanguage ConsistencyThinking Length

Related Fields

Natural Language ProcessingArtificial IntelligenceComputational LinguisticsMachine Learning

Keywords

multilingualmathematical reasoningLLMbenchmarkevaluationnatural language processingartificial intelligencelanguage diversitycross-lingualdeep learningcomputational linguisticsreasoningtranslationperformancechallenges

Academic Context

#Multilingual Natural Language Processing#Large Language Model Evaluation#Mathematical Reasoning#Cross-lingual Transfer Learning#LLM Performance Analysis

Commercial Potential

Target Industries

TechnologyEducationResearch

Use Case Examples

Evaluating the mathematical reasoning capabilities of LLMs in different languages.Identifying language-specific biases or weaknesses in LLMs.Developing more robust and equitable multilingual AI systems.

Competitive Edge

Positions PolyMath as a more discriminative and comprehensive multilingual mathematical reasoning benchmark compared to existing ones.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers