arxiv_ai 95% Match Research Paper AI researchers,ML engineers,Developers of large language and vision-language models 1 week ago

Beyond Reasoning Gains: Mitigating General Capabilities Forgetting in Large Reasoning Models

large-language-models › training-methods

📄 Abstract

Abstract: Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, where models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are calculated on the current task, thus they do not guarantee broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training focus each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts in an online manner using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks based on Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

Authors (9)

Hoang Phan

Xianjun Yang

Kevin Yao

Jingyu Zhang

Shengjie Bi

Xiaocheng Tang

+3 more

Submitted

October 24, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper addresses the critical issue of capability regression (forgetting foundational skills) in large reasoning models trained with Reinforcement Learning with Verifiable Rewards (RLVR). It proposes RECAP, a novel replay strategy that aims to prevent performance degradation on core capabilities by intelligently managing training focus across heterogeneous domains, going beyond standard regularization techniques like KL divergence.

Business Value

Ensures that advanced reasoning models retain essential foundational abilities, leading to more reliable and robust AI systems for complex applications. Reduces the need for costly retraining or fine-tuning to recover lost capabilities.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Feasible, as it's a training methodology. Requires careful implementation of the RECAP replay strategy within existing RL training pipelines.

Limitations Addressed

Capability regression/forgetting of foundational skills after RLVR training,Limitations of KL divergence regularization (task-specific),Difficulty in balancing training focus with experience replay across domains

Technical Tags

reinforcement learning with verifiable rewardscapability regressionfoundational skillsregularizationKL divergenceexperience replaylarge reasoning modelslanguage modelsvision-language modelsRECAP

Research Topics

Reinforcement LearningLarge Language ModelsModel TrainingAI SafetyGeneralization

Methods & Architectures

Reinforcement Learning with Verifiable Rewards (RLVR)KL Divergence regularizationExperience ReplayRECAP (proposed method) Large Language ModelsVision-Language Models

Applications & Tasks

AI Model Training Reasoning Tasks Multimodal AI Capability ForgettingPerformance DegradationOverfitting to specific tasks Mitigating capability regressionPreserving foundational skills during RL trainingImproving generalization of reasoning models

Related Fields

Reinforcement LearningMachine LearningNatural Language ProcessingComputer VisionAI Safety

Keywords

reinforcement learninglarge language modelscapability regressionmodel forgettingregularizationexperience replayreasoning modelsAI safetyfoundational skillsRECAPRLVRgeneralization

Academic Context

#Reinforcement Learning#Large Language Models#Model Training#AI Safety#Generalization

Commercial Potential

Potential Products

More stable and reliable large reasoning modelsImproved training methodologies for foundation models

Target Industries

TechnologyAI ResearchSoftware Development

Use Case Examples

Training AI models for complex reasoning tasks without losing basic understandingDeveloping robust multimodal AI assistantsEnsuring AI systems remain dependable over time

Competitive Edge

Offers a more effective solution to capability regression than standard regularization methods, leading to better-performing and more stable large reasoning models.

Market Opportunity

Rapid growth in the foundation model market, driving demand for improved training techniques.

Revenue Models

Licensing of training methodologiesconsulting services.

Resource Requirements

Compute Needs

High compute requirements, typical for training large-scale models with RL.

Data Requirements

Requires diverse datasets covering foundational skills and specific reasoning tasks.

Deployment Constraints

Complexity of implementing the RECAP strategy,Potential increase in training time/cost

Scalability

Designed for large-scale models, so scalability is inherent to the problem domain.

Regulatory Considerations

None directlybut related to responsible AI development.

Production Readiness

Maturity Level

Research/Prototype

Time to Market

1-2 years for integration into existing training frameworks.

Patent Potential

Moderate, for the novel RECAP replay strategy.

View Full Paper Back to Papers