arxiv_cl 96% Match Research Paper AI Researchers,Mathematicians,Computer Scientists,Researchers in AI for Science 1 month ago

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

large-language-models › reasoning

📄 Abstract

Abstract: As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are limited, as they focus solely on final-answer questions or high-school competition problems. To address this gap, we introduce IMProofBench, a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians. Each problem requires a detailed proof and is paired with subproblems that have final answers, supporting both an evaluation of mathematical reasoning capabilities by human experts and a large-scale quantitative analysis through automated grading. Furthermore, unlike prior benchmarks, the evaluation setup simulates a realistic research environment: models operate in an agentic framework with tools like web search for literature review and mathematical software such as SageMath. Our results show that current LLMs can succeed at the more accessible research-level questions, but still encounter significant difficulties on more challenging problems. Quantitatively, Grok-4 achieves the highest accuracy of 52% on final-answer subproblems, while GPT-5 obtains the best performance for proof generation, achieving a fully correct solution for 22% of problems. IMProofBench will continue to evolve as a dynamic benchmark in collaboration with the mathematical community, ensuring its relevance for evaluating the next generation of LLMs.

Key Contributions

Introduces IMProofBench, a novel benchmark for evaluating LLMs on research-level mathematical proof generation, featuring peer-reviewed problems and an agentic framework with tool use (web search, SageMath). This addresses the limitations of existing benchmarks that focus on simpler problems and provides a more realistic research environment for AI evaluation.

Business Value

Accelerates AI's contribution to scientific discovery by enabling rigorous evaluation of AI systems in complex domains like mathematics, potentially leading to AI-assisted research breakthroughs.

Paper Metadata

Innovation Type

New benchmark and evaluation framework

Deployment Feasibility

Moderate, as it focuses on evaluation rather than direct deployment of a specific application.

Limitations Addressed

Existing benchmarks are limited to high-school level problems or final-answer questions, failing to capture LLMs' capabilities in complex, research-level mathematical reasoning and proof generation.

Performance Gains

Results show current LLMs struggle with research-level proofs.

Technical Tags

Mathematical Proof GenerationLLM BenchmarkingAgentic FrameworkTool UseMathematical ReasoningAutomated GradingResearch-Level MathematicsSageMath

Research Topics

LLM ReasoningAI in MathematicsBenchmarking AI SystemsAgentic AIFormal Verification

Methods & Architectures

Agentic FrameworkTool Integration (Web Search, SageMath)Automated GradingHuman Evaluation Large Language Models (LLMs)

Applications & Tasks

Mathematical Research Automated Theorem Proving AI for Science Evaluating LLMs on research-level mathBridging the gap between LLM capabilities and mathematical researchDeveloping robust math reasoning benchmarks Mathematical proof generationSolving complex mathematical problemsVerifying mathematical theorems

Datasets & Benchmarks

Datasets

IMProofBench

Proof correctnessReasoning qualityTask completion rate

Related Fields

Artificial IntelligenceMathematicsComputer ScienceFormal MethodsNatural Language Processing

Keywords

LLMMathematicsProof GenerationBenchmarkReasoningAgentTool UseSageMathResearchAIIMProofBench

Academic Context

#LLM Reasoning#AI in Mathematics#Benchmarking AI Systems#Agentic AI#Formal Verification

Technology Stack

Frameworks & Libraries

SageMath

Commercial Potential

Potential Products

AI-powered mathematical research assistantsAutomated theorem provers

Target Industries

AcademiaResearch & DevelopmentTechnologyMathematics

Use Case Examples

Assisting mathematicians in proving theoremsAutomating parts of mathematical researchEvaluating AI's mathematical capabilities

Competitive Edge

Establishes a new standard for evaluating LLMs in advanced mathematical reasoning, pushing the boundaries beyond existing simpler benchmarks.

Market Opportunity

Growing interest in AI for scientific discovery and formal verification.

Resource Requirements

Compute Needs

Significant compute for LLM inference and potentially for running mathematical software like SageMath.

Data Requirements

The IMProofBench dataset of mathematical problems.

Deployment Constraints

Complexity of mathematical proofs and the need for specialized tools.

Scalability

Scales with the complexity and number of mathematical problems in the benchmark.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers