Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Existing benchmarks for evaluating mathematical reasoning in large language
models (LLMs) rely primarily on competition problems, formal proofs, or
artificially challenging questions -- failing to capture the nature of
mathematics encountered in actual research environments. We introduce RealMath,
a novel benchmark derived directly from research papers and mathematical forums
that assesses LLMs' abilities on authentic mathematical tasks. Our approach
addresses three critical challenges: sourcing diverse research-level content,
enabling reliable automated evaluation through verifiable statements, and
designing a continually refreshable dataset to mitigate contamination risks.
Experimental results across multiple LLMs reveal surprising capabilities in
handling research mathematics compared to competition problems, suggesting
current models may already serve as valuable assistants for working
mathematicians despite limitations on highly challenging problems. The code and
dataset for RealMath are publicly available.
Authors (4)
Jie Zhang
Cezara Petrui
Kristina NikoliΔ
Florian Tramèr
Key Contributions
This paper introduces RealMath, a novel benchmark for evaluating LLMs on research-level mathematics, which differs from existing benchmarks focused on competition problems. RealMath is derived from authentic research papers and forums, uses verifiable statements for automated evaluation, and is designed to be continually refreshed to mitigate contamination risks. Experiments show LLMs have surprising capabilities in research math.
Business Value
Provides a crucial tool for developers and researchers to accurately assess and improve LLM capabilities in advanced mathematical domains, accelerating progress in AI for scientific discovery.