arxiv_ml 95% Match Research Paper LLM researchers,AI developers,Software engineers,Researchers in AI safety and alignment 2 weeks ago

ReVeal: Self-Evolving Code Agents via Reliable Self-Verification

large-language-models › reasoning

📄 Abstract

Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.

Authors (7)

Yiyang Jin

Kunzhao Xu

Hang Li

Xueting Han

Yanmin Zhou

Cheng Li

+1 more

Submitted

June 13, 2025

arXiv Category

cs.SE

arXiv PDF

Key Contributions

Introduces ReVeal, a multi-turn RL framework that enhances LLM code generation by explicitly optimizing self-verification and leveraging tool-based evaluation. This approach fosters the co-evolution of code and test generation, enabling more reliable long-horizon reasoning and deeper test-time scaling.

Business Value

Accelerates software development by automating code generation and testing, improving code quality and reducing developer workload.

Paper Metadata

Innovation Type

Framework and Algorithmic

Deployment Feasibility

Moderate. Requires integration with LLM APIs and execution environments for tools; the RL training process can be complex.

Limitations Addressed

Unreliable self-verification in existing RLVR methods; limited test-time scaling; difficulty in handling long-horizon reasoning tasks.

Performance Gains

Improved reliability and effectiveness of LLM-driven code generation and reasoning, particularly for complex, long-horizon tasks.

Technical Tags

reinforcement learningverifiable rewardscode generationself-verificationtool uselarge language modelsiterative refinementtest generationTAPOlong-horizon reasoning

Research Topics

Large Language ModelsReinforcement LearningCode GenerationReasoningAI SafetyAutomated Testing

Methods & Architectures

ReVeal frameworkReinforcement Learning with Verifiable Rewards (RLVR)Iterative generation-verification turnsTool-based evaluationTAPO (Turn-level Assignment of Policy Outcome) Large Language Models (LLMs)

Applications & Tasks

Software Development AI Agent Development Automated Reasoning Improving reliability of LLM self-verificationEnhancing LLM reasoning for code generationEnabling long-horizon task completionCo-evolution of code and tests Generating correct and robust code through iterative self-verification and tool-based evaluation

Related Fields

Machine LearningLarge Language ModelsReinforcement LearningSoftware EngineeringAI Safety

Keywords

large language modelsllmreinforcement learningcode generationself-verificationreasoningai agentsautomated testingrlvrtool use

Academic Context

#Large Language Models#Reinforcement Learning#Code Generation#Reasoning#AI Safety#Automated Testing

Technology Stack

Programming Languages

Python

Commercial Potential

Potential Products

AI-powered code generation toolsAutomated software testing platformsIntelligent agents for complex task execution

Target Industries

Software DevelopmentTechnologyFintechGaming

Use Case Examples

Automated generation of complex software modulesAI assistants for debugging and testing codeAgents that can autonomously complete multi-step programming tasks

Competitive Edge

Enhances RLVR by explicitly optimizing self-verification and integrating tool feedback, leading to more reliable and scalable code generation compared to methods relying solely on outcome rewards.

Market Opportunity

Large and growing market for AI-assisted software development tools.

Revenue Models

Licensing of AI code generation enginesSaaS platforms for automated development.

Resource Requirements

Compute Needs

High for RL training, moderate for inference.

Data Requirements

Requires environments for code execution and tool interaction; training data for initial LLM.

Deployment Constraints

Complexity of the RL training loop; ensuring the reliability of self-verification across diverse tasks; integration with various tools.

Scalability

Scalability is a key focus, aiming for improved test-time scaling through enhanced self-verification.

Regulatory Considerations

Ethical implications of AI-generated code; potential for misuse; bias in generated code.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust integration into development workflows.

Patent Potential

Moderate, for the ReVeal framework and its specific mechanisms.

View Full Paper Back to Papers