Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reinforcement learning with verifiable rewards (RLVR) has delivered
impressive gains in mathematical and multimodal reasoning and has become a
standard post-training paradigm for contemporary language and vision-language
models. However, the RLVR recipe introduces a significant risk of capability
regression, where models forget foundational skills after prolonged training
without employing regularization strategies. We empirically confirm this
concern, observing that open-source reasoning models suffer performance
degradation on core capabilities such as perception and faithfulness. While
imposing regularization terms like KL divergence can help prevent deviation
from the base model, these terms are calculated on the current task, thus they
do not guarantee broader knowledge. Meanwhile, commonly used experience replay
across heterogeneous domains makes it nontrivial to decide how much training
focus each objective should receive. To address this, we propose RECAP-a replay
strategy with dynamic objective reweighting for general knowledge preservation.
Our reweighting mechanism adapts in an online manner using short-horizon
signals of convergence and instability, shifting the post-training focus away
from saturated objectives and toward underperforming or volatile ones. Our
method is end-to-end and readily applicable to existing RLVR pipelines without
training additional models or heavy tuning. Extensive experiments on benchmarks
based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our
method, which not only preserves general capabilities but also improves
reasoning by enabling more flexible trade-offs among in-task rewards.
Authors (9)
Hoang Phan
Xianjun Yang
Kevin Yao
Jingyu Zhang
Shengjie Bi
Xiaocheng Tang
+3 more
Submitted
October 24, 2025
Key Contributions
This paper addresses the critical issue of capability regression (forgetting foundational skills) in large reasoning models trained with Reinforcement Learning with Verifiable Rewards (RLVR). It proposes RECAP, a novel replay strategy that aims to prevent performance degradation on core capabilities by intelligently managing training focus across heterogeneous domains, going beyond standard regularization techniques like KL divergence.
Business Value
Ensures that advanced reasoning models retain essential foundational abilities, leading to more reliable and robust AI systems for complex applications. Reduces the need for costly retraining or fine-tuning to recover lost capabilities.