arxiv_ai 95% Match Research Paper ML Researchers,NLP Engineers,LLM Developers 2 weeks ago

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

large-language-models › training-methods

📄 Abstract

Abstract: We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.

Authors (5)

Mengqi Li

Lei Zhao

Anthony Man-Cho So

Ruoyu Sun

Xiao Li

Submitted

October 21, 2025

arXiv Category

cs.LG

arXiv PDF Code

Key Contributions

Introduces a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning that is reward-free and uses only one rollout by default. OSFT achieves downstream performance on mathematical reasoning tasks comparable to strong RLVR methods like GRPO, demonstrating surprising effectiveness and efficiency by leveraging the model's latent knowledge.

Business Value

OSFT offers a highly efficient and cost-effective way to improve LLM reasoning capabilities. This can accelerate the development and deployment of LLMs for tasks requiring complex reasoning, reducing training costs and time-to-market.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

High. The method is simple, efficient, and achieves strong results, making it attractive for practical application.

Limitations Addressed

The complexity and cost associated with reward-based training methods like RLVR for improving LLM reasoning. Addresses the need for more efficient and simpler finetuning strategies.

Performance Gains

Comparable performance to strong RLVR methods (like GRPO) on mathematical reasoning tasks, with significantly higher efficiency (reward-free, single rollout).

View Code on GitHub

Technical Tags

Online Supervised Finetuning (OSFT)LLM ReasoningSelf-TuningReward-Free TrainingSingle RolloutMathematical Reasoning TasksReinforcement Learning with Verifiable Rewards (RLVR)GRPOLatent KnowledgeEfficiencyRobustness

Research Topics

Efficient LLM TrainingLLM Reasoning ImprovementSelf-Supervised Learning for LLMsReward-Free RL AlternativesLLM Finetuning Strategies

Methods & Architectures

Online Supervised Finetuning (OSFT)Self-generation of dataFinetuning on self-generated dataEmpirical EvaluationAblation Study Large Language Models (LLMs)

Applications & Tasks

Mathematical Reasoning Complex Problem Solving General LLM Training Complexity and cost of reward-based training (RLVR)Need for efficient LLM reasoning improvement methods Improving LLM ReasoningLLM FinetuningSelf-supervised Learning

Datasets & Benchmarks

Benchmarks

Challenging mathematical reasoning tasks

Downstream PerformanceEfficiencyRobustness

Related Fields

Machine LearningNatural Language ProcessingReinforcement LearningSelf-Supervised Learning

Keywords

LLMreasoningfinetuningonline learningsupervised learningreward-freeself-tuningefficiencyOSFTRLVRmathematical reasoning

Academic Context

#Efficient LLM Training#LLM Reasoning Improvement#Self-Supervised Learning for LLMs#Reward-Free RL Alternatives#LLM Finetuning Strategies

Commercial Potential

Potential Products

More capable LLMs for reasoning tasksEfficient LLM training services

Target Industries

TechnologySoftware DevelopmentAI Research

Use Case Examples

Improving LLMs for solving math problemsEnhancing LLMs for logical deduction tasks

Competitive Edge

Presents a promising, efficient, and reward-free alternative to complex reward-based training paradigms like RLVR for improving LLM reasoning.

Market Opportunity

Large and growing market for LLM training and fine-tuning solutions.

Resource Requirements

Compute Needs

Less compute-intensive than RLVR methods due to reward-free and single-rollout nature.

Data Requirements

Does not require external reward models, relies on self-generated data.

Deployment Constraints

Effectiveness might depend on the quality of self-generated data,Potential for error propagation

Scalability

The efficiency and reward-free nature suggest good scalability for training LLMs.

Production Readiness

Maturity Level

Research/Promising

View Full Paper Back to Papers