Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
Introduces a simple, self-help online supervised finetuning (OSFT) paradigm for LLM reasoning that is reward-free and uses only one rollout by default. OSFT achieves downstream performance on mathematical reasoning tasks comparable to strong RLVR methods like GRPO, demonstrating surprising effectiveness and efficiency by leveraging the model's latent knowledge.
OSFT offers a highly efficient and cost-effective way to improve LLM reasoning capabilities. This can accelerate the development and deployment of LLMs for tasks requiring complex reasoning, reducing training costs and time-to-market.