arxiv_ml 95% Match Empirical Research Paper LLM researchers,ML engineers,AI alignment researchers,NLP practitioners 1 day ago

RL Fine-Tuning Heals OOD Forgetting in SFT

large-language-models › training-methods

📄 Abstract

Abstract: The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim "SFT memorizes, RL generalizes" is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT

Authors (7)

Hangzhan Jin

Sitao Luan

Sicheng Lyu

Guillaume Rabusseau

Reihaneh Rabbany

Doina Precup

+1 more

Submitted

September 8, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Discovers that Supervised Fine-Tuning (SFT) can lead to 'OOD forgetting' where out-of-distribution performance degrades over time. Reinforcement Learning (RL) fine-tuning acts as an 'OOD restoration' mechanism, recovering lost reasoning ability, rather than generating fundamentally better OOD capability.

Business Value

Provides crucial insights into optimizing LLM fine-tuning processes, leading to models with better generalization and reasoning abilities, essential for reliable AI assistants and applications.

Paper Metadata

Innovation Type

Mechanism Discovery and Explanation

Deployment Feasibility

High, directly impacts LLM training methodologies.

Limitations Addressed

The over-simplified understanding of SFT vs. RL fine-tuning, and the phenomenon of OOD performance decline during SFT.

Performance Gains

RL fine-tuning restores OOD ability lost during SFT, but its effectiveness is bounded by SFT duration.

Technical Tags

LLM Fine-TuningSupervised Fine-Tuning (SFT)Reinforcement Learning (RL)Out-of-Distribution (OOD) PerformanceOOD ForgettingOOD RestorationReasoning AbilityMechanism ExplorationTraining DynamicsLLM Alignment

Research Topics

Large Language Model TrainingFine-Tuning StrategiesReinforcement Learning from Human Feedback (RLHF)Model GeneralizationLLM Alignment and Robustness

Methods & Architectures

Empirical study of SFT and RL fine-tuning stagesAnalysis of OOD performance evolutionInvestigating training dynamics Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing AI Alignment Model Training OOD performance degradation during SFTUnderstanding the synergy between SFT and RLImproving LLM reasoning and generalization LLM Fine-TuningImproving OOD generalizationEnhancing reasoning capabilities

Related Fields

Large Language ModelsReinforcement LearningMachine Learning TrainingAI AlignmentNatural Language Processing

Keywords

LLMFine-TuningSFTRLOOD GeneralizationOOD ForgettingOOD RestorationReasoningLLM TrainingAlignmentReinforcement Learning from Human FeedbackModel Robustness

Academic Context

#Large Language Model Training#Fine-Tuning Strategies#Reinforcement Learning from Human Feedback (RLHF)#Model Generalization#LLM Alignment and Robustness

Commercial Potential

Potential Products

Optimized LLM fine-tuning pipelinesLLM training frameworks incorporating these findings

Target Industries

AI DevelopmentTechnologySoftware

Use Case Examples

Developing more robust AI assistants that maintain reasoning capabilities across diverse inputsImproving LLMs for tasks requiring generalization beyond training data

Competitive Edge

Offers a deeper mechanistic understanding of the SFT+RL fine-tuning paradigm, guiding more effective training strategies.

Market Opportunity

Massive market for LLM development and deployment.

Revenue Models

Indirectthrough improved LLM performance and efficiency.

Resource Requirements

Compute Needs

Requires significant compute for LLM fine-tuning experiments.

Data Requirements

Requires datasets for SFT and OOD evaluation.

Deployment Constraints

The effectiveness of RL restoration is sensitive to the duration of SFT.

Scalability

Findings are applicable to large-scale LLM training.

Production Readiness

Maturity Level

Research Finding

Time to Market

Immediate impact on LLM training practices.

Patent Potential

Low (Research finding on training dynamics)

View Full Paper Back to Papers