Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper ML Researchers,NLP Engineers,LLM Developers 2 weeks ago

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

large-language-models › training-methods
📄 Abstract

Abstract: We present a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning. In this paradigm, the model generates its own responses and is immediately finetuned on this self-generated data. OSFT is a highly efficient training strategy for LLM reasoning, as it is reward-free and uses just one rollout by default. Experiment results show that OSFT achieves downstream performance on challenging mathematical reasoning tasks comparable to strong reinforcement learning with verifiable rewards (RLVR) methods such as GRPO. Our ablation study further demonstrates the efficiency and robustness of OSFT. The major mechanism of OSFT lies in facilitating the model's own existing preference (latent knowledge) learned from pretraining, which leads to reasoning ability improvement. We believe that OSFT offers an efficient and promising alternative to more complex, reward-based training paradigms. Our code is available at https://github.com/ElementQi/OnlineSFT.
Authors (5)
Mengqi Li
Lei Zhao
Anthony Man-Cho So
Ruoyu Sun
Xiao Li
Submitted
October 21, 2025
arXiv Category
cs.LG
arXiv PDF Code

Key Contributions

Introduces a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning that is reward-free and uses only one rollout by default. OSFT achieves downstream performance on mathematical reasoning tasks comparable to strong RLVR methods like GRPO, demonstrating surprising effectiveness and efficiency by leveraging the model's latent knowledge.

Business Value

OSFT offers a highly efficient and cost-effective way to improve LLM reasoning capabilities. This can accelerate the development and deployment of LLMs for tasks requiring complex reasoning, reducing training costs and time-to-market.

View Code on GitHub