arxiv_cl 85% Match Research Paper / Dataset Paper AI researchers,NLP researchers,ML engineers,Educators,Mathematicians 3 weeks ago

MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

large-language-models › evaluation

📄 Abstract

Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist

Authors (5)

Mahbub E Sobhani

Md. Faiyaz Abdullah Sayeedi

Tasnim Mohiuddin

Md Mofijul Islam

Swakkhar Shatabda

Submitted

October 16, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

MathMist introduces a parallel multilingual benchmark dataset for mathematical problem solving and reasoning, comprising over 21K aligned question-answer pairs across seven languages. It addresses the gap in evaluating LLMs' mathematical capabilities across diverse linguistic settings, including low-resource languages.

Business Value

Provides a crucial resource for advancing AI's capabilities in mathematical reasoning, enabling the development of more capable educational tools, research assistants, and problem-solving AI systems across different languages.

Paper Metadata

Innovation Type

Dataset/Benchmark

Deployment Feasibility

N/A (dataset release)

Limitations Addressed

Lack of comprehensive multilingual benchmarks for evaluating mathematical reasoning in LLMs, particularly for low-resource languages.

Technical Tags

Mathematical reasoningMultilingual benchmarkProblem solvingLLMsLinguistic understandingLogical deductionNumerical precisionCross-lingualLow-resource languagesHigh-resource languages

Research Topics

Natural Language ProcessingMachine LearningMathematicsReasoningMultilingual AIBenchmark Creation

Methods & Architectures

Benchmark dataset creationParallel multilingual data alignmentEvaluation of LLMs on mathematical tasks Large Language Models (LLMs)

Applications & Tasks

Education AI Research Mathematics Challenging mathematical reasoning for LLMsUnderexplored mathematical competence across languagesGaps in existing multilingual benchmarksAssessing cross-lingual mathematical reasoning Mathematical problem solvingMathematical reasoningMultilingual NLP evaluation

Datasets & Benchmarks

Datasets

MathMist

Benchmarks

MathMist

Related Fields

Artificial IntelligenceComputational LinguisticsEducation TechnologyMathematics EducationMultilingual Computing

Keywords

mathematical reasoningLLMbenchmarkmultilingualdatasetproblem solvingcross-linguallow-resource languagesAINLPeducationmathevaluationreasoning

Academic Context

#Natural Language Processing#Machine Learning#Mathematics#Reasoning#Multilingual AI#Benchmark Creation

Commercial Potential

Potential Products

AI tutors for mathematicsAutomated math problem solversMultilingual educational platformsResearch tools for AI reasoning

Target Industries

EducationTechnologyPublishingResearch

Use Case Examples

Evaluating how well different LLMs can solve math problems presented in various languages.Developing AI systems that can assist students with homework across language barriers.Benchmarking progress in AI's ability to perform complex mathematical reasoning.

Competitive Edge

Fills a significant gap in the evaluation landscape by providing a comprehensive, multilingual benchmark specifically for mathematical reasoning, enabling more thorough assessment of LLM capabilities.

Market Opportunity

Growing market for AI in education and the need for robust multilingual AI evaluation.

Revenue Models

Data access for research institutionsintegration into AI evaluation platforms.

Resource Requirements

Compute Needs

Low (for dataset creation/usage)

Data Requirements

Requires careful curation and alignment of mathematical problems across languages.

Deployment Constraints

Ensuring the quality and accuracy of mathematical problems and their translations.

Scalability

The dataset can be expanded to include more languages or problem types.

Production Readiness

Maturity Level

Dataset/Benchmark

Time to Market

N/A (dataset release)

Patent Potential

Low (dataset publication)

View Full Paper Back to Papers