Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While Large Language Models (LLMs) excel at code generation by learning from
vast code corpora, a fundamental semantic gap remains between their training on
textual patterns and the goal of functional correctness, which is governed by
formal execution semantics. Reinforcement Learning with Verifiable Rewards
(RLVR) approaches attempt to bridge this gap using outcome rewards from
executing test cases. However, solely relying on binary pass/fail signals is
inefficient for establishing a well-aligned connection between the textual
representation of code and its execution semantics, especially for subtle
logical errors within the code. In this paper, we propose CodeRL+, a novel
approach that integrates execution semantics alignment into the RLVR training
pipeline for code generation. CodeRL+ enables the model to infer variable-level
execution trajectory, providing a direct learning signal of execution
semantics. CodeRL+ can construct execution semantics alignment directly using
existing on-policy rollouts and integrates seamlessly with various RL
algorithms. Extensive experiments demonstrate that CodeRL+ outperforms
post-training baselines (including RLVR and Distillation), achieving a 4.6%
average relative improvement in pass@1. CodeRL+ generalizes effectively to
other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning
and test-output-generation benchmarks, respectively. CodeRL+ shows strong
applicability across diverse RL algorithms and LLMs. Furthermore, probe
analyses provide compelling evidence that CodeRL+ strengthens the alignment
between code's textual representations and its underlying execution semantics.
Authors (13)
Xue Jiang
Yihong Dong
Mengyang Liu
Hongyi Deng
Tian Wang
Yongding Tao
+7 more
Submitted
October 21, 2025
Key Contributions
CodeRL+ enhances code generation by integrating execution semantics alignment into the RLVR pipeline. It enables LLMs to infer variable-level execution trajectories, providing a direct learning signal that bridges the semantic gap between textual code patterns and functional correctness. This approach is more effective than relying solely on binary test case outcomes for identifying and correcting subtle logical errors.
Business Value
Leads to more reliable and functionally correct code generation, reducing debugging time and improving the quality of software produced by AI, which can significantly boost developer productivity.