Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As the mathematical capabilities of large language models (LLMs) improve, it
becomes increasingly important to evaluate their performance on research-level
tasks at the frontier of mathematical knowledge. However, existing benchmarks
are limited, as they focus solely on final-answer questions or high-school
competition problems. To address this gap, we introduce IMProofBench, a private
benchmark consisting of 39 peer-reviewed problems developed by expert
mathematicians. Each problem requires a detailed proof and is paired with
subproblems that have final answers, supporting both an evaluation of
mathematical reasoning capabilities by human experts and a large-scale
quantitative analysis through automated grading. Furthermore, unlike prior
benchmarks, the evaluation setup simulates a realistic research environment:
models operate in an agentic framework with tools like web search for
literature review and mathematical software such as SageMath. Our results show
that current LLMs can succeed at the more accessible research-level questions,
but still encounter significant difficulties on more challenging problems.
Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer
subproblems, while GPT-5 obtains the best performance for proof generation,
achieving a fully correct solution for 22% of problems. IMProofBench will
continue to evolve as a dynamic benchmark in collaboration with the
mathematical community, ensuring its relevance for evaluating the next
generation of LLMs.
Key Contributions
Introduces IMProofBench, a novel benchmark for evaluating LLMs on research-level mathematical proof generation, featuring peer-reviewed problems and an agentic framework with tool use (web search, SageMath). This addresses the limitations of existing benchmarks that focus on simpler problems and provides a more realistic research environment for AI evaluation.
Business Value
Accelerates AI's contribution to scientific discovery by enabling rigorous evaluation of AI systems in complex domains like mathematics, potentially leading to AI-assisted research breakthroughs.