Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reinforcement learning (RL) has emerged as an effective approach for
enhancing the reasoning capabilities of large language models (LLMs),
especially in scenarios where supervised fine-tuning (SFT) falls short due to
limited chain-of-thought (CoT) data. Among RL-based post-training methods,
group relative advantage estimation, as exemplified by Group Relative Policy
Optimization (GRPO), has attracted considerable attention for eliminating the
dependency on the value model, thereby simplifying training compared to
traditional approaches like Proximal Policy Optimization (PPO). However, we
observe that exsiting group relative advantage estimation method still suffers
from training inefficiencies, particularly when the estimated advantage
approaches zero. To address this limitation, we propose Advantage-Augmented
Policy Optimization (AAPO), a novel RL algorithm that optimizes the
cross-entropy (CE) loss using advantages enhanced through a momentum-based
estimation scheme. This approach effectively mitigates the inefficiencies
associated with group relative advantage estimation. Experimental results on
multiple mathematical reasoning benchmarks demonstrate the superior performance
of AAPO.