Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 85% Match Research Paper Formal verification researchers,Mathematicians,AI researchers,Software engineers 1 week ago

Reliable Evaluation and Benchmarks for Statement Autoformalization

large-language-models › reasoning
📄 Abstract

Abstract: Evaluating statement autoformalization, translating natural language mathematics into formal languages like Lean 4, remains a significant challenge, with few metrics, datasets, and standards to robustly measure progress. In this work, we present a comprehensive approach combining improved metrics, robust benchmarks, and systematic evaluation, to fill this gap. First, we introduce BEq+, an automated metric that correlates strongly with human judgment, along with ProofNetVerif, a new dataset for assessing the quality of evaluation metrics, containing 3,752 annotated examples. Second, we develop two new autoformalization benchmarks: ProofNet#, a corrected version of ProofNet, and RLM25, with 619 new pairs of research-level mathematics from six formalization projects. Through systematic experimentation across these benchmarks, we find that current techniques can achieve up to 45.1% accuracy on undergraduate mathematics but struggle with research-level content without proper context. Our work establishes a reliable foundation for evaluating and advancing autoformalization systems.
Authors (4)
Auguste Poiroux
Gail Weiss
Viktor Kunčak
Antoine Bosselut
Submitted
June 11, 2024
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper addresses the challenge of evaluating statement autoformalization by introducing BEq+, a new automated metric that correlates well with human judgment, and ProofNetVerif, a dataset for metric evaluation. It also presents two new benchmarks, ProofNet# and RLM25, containing research-level mathematics, to better assess autoformalization systems.

Business Value

Enables more reliable development and deployment of AI tools that can assist mathematicians and software engineers by formalizing complex mathematical statements, potentially improving the rigor and efficiency of mathematical research and software verification.