Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reinforcement learning with verifiable rewards (RLVR) has advanced the
reasoning capabilities of large language models. However, existing methods rely
solely on outcome rewards, without explicitly optimizing verification or
leveraging reliable signals from realistic environments, leading to unreliable
self-verification and limited test-time scaling. To address this, we widen the
verification-generation asymmetry by explicitly optimizing self-verification,
making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a
multi-turn reinforcement learning framework that evolves code generation
through self-verification and tool-based evaluation. ReVeal structures
long-horizon reasoning as iterative generation-verification turns and
incorporates TAPO for turn-level credit assignment, fostering the co-evolution
of code and test generation. At inference, this strengthened self-verification
enables the model to use self-constructed tests and tool feedback to
continuously evolve code for 20+ turns on LiveCodeBench despite training on
only three. It also significantly improves Pass@k, indicating stronger
exploration that expands the reasoning boundaries of the base model. These
findings highlight the promise of ReVeal as a scalable paradigm for RL training
and test-time scaling, paving the way for more robust and autonomous AI agents.
Authors (7)
Yiyang Jin
Kunzhao Xu
Hang Li
Xueting Han
Yanmin Zhou
Cheng Li
+1 more
Key Contributions
Introduces ReVeal, a multi-turn RL framework that enhances LLM code generation by explicitly optimizing self-verification and leveraging tool-based evaluation. This approach fosters the co-evolution of code and test generation, enabling more reliable long-horizon reasoning and deeper test-time scaling.
Business Value
Accelerates software development by automating code generation and testing, improving code quality and reducing developer workload.