arxiv_cl 90% Match Research Paper AI Researchers,Robotics Engineers,Astrophysicists,Scientific Software Developers 1 week ago

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

large-language-models › reasoning

📄 Abstract

Abstract: Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.

Authors (13)

Christine Ye

Sihan Yuan

Suchetha Cooray

Steven Dillmann

Ian L. V. Roque

Dalya Baron

+7 more

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces ReplicationBench, an evaluation framework to test AI agents' ability to replicate entire astrophysics research papers. It splits papers into tasks covering experimental setup, derivations, data analysis, and codebase, co-developed with authors for objective evaluation of AI as scientific research assistants.

Business Value

Enables reliable integration of AI agents into scientific workflows, accelerating discovery and ensuring research integrity.

Paper Metadata

Innovation Type

Evaluation Framework

Deployment Feasibility

High for the framework itself; deployment of AI agents capable of full replication is still nascent.

Limitations Addressed

Lack of standardized methods to assess the faithfulness and correctness of AI agents in performing complex scientific research tasks.

Technical Tags

AI agentsscientific researchreplicationastrophysicsevaluation frameworkcomputational studyAI assistantstask decompositioncode replicationdata analysis

Research Topics

AI AgentsScientific DiscoveryAI for ScienceReproducibility in ScienceLarge Language Models

Methods & Architectures

evaluation framework designtask decompositionhuman-AI collaboration (co-development) Frontier AI agents

Applications & Tasks

Scientific Research Astrophysics AI Agent Development Evaluating AI Agent CapabilitiesAssessing AI Faithfulness in ResearchReproducibility of Scientific Work Replicating research papersAssisting scientific researchEvaluating AI agent performance

Datasets & Benchmarks

Datasets

Astrophysics literature (archival data)

Related Fields

Artificial IntelligenceRoboticsScientific ComputingAstrophysicsReproducibility

Keywords

AI agentsscientific researchreplicationastrophysicsevaluationAI assistantcomputational studyreproducibilityLLMtask completioncode generationdata analysis

Academic Context

#AI Agents#Scientific Discovery#AI for Science#Reproducibility in Science#Large Language Models

Commercial Potential

Potential Products

AI-powered research assistantsAutomated scientific validation toolsPlatforms for AI-driven scientific discovery

Target Industries

AcademiaResearch & DevelopmentScientific PublishingTechnology

Use Case Examples

AI agents automatically verifying experimental results from papersAI assisting in writing and debugging scientific codeAI performing complex data analysis described in research

Competitive Edge

Provides a novel and rigorous evaluation methodology for AI agents in scientific contexts.

Market Opportunity

Growing market for AI in scientific research.

Revenue Models

SaaS for AI research assistantslicensing of evaluation tools.

Resource Requirements

Compute Needs

High (for running AI agents on complex tasks)

Data Requirements

Access to scientific literature and associated data/code.

Deployment Constraints

Requires sophisticated AI agents capable of complex reasoning, planning, and execution.

Scalability

Scalability depends on the underlying AI agent's capabilities and the complexity of the research tasks.

Production Readiness

Maturity Level

Research Framework

Time to Market

3-5 years (for capable AI agents)

Patent Potential

Low

View Full Paper Back to Papers