Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant
progress in incentivizing the multi-turn, long-horizon tool-use capabilities of
web agents. While mainstream agentic RL algorithms autonomously explore
high-uncertainty tool-call steps under the guidance of entropy, excessive
reliance on entropy signals can impose further constraints, leading to the
training collapse. In this paper, we delve into the challenges caused by
entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an
agentic RL algorithm designed to balance entropy in both the rollout and policy
update phases. AEPO comprises two core components: (1) a dynamic
entropy-balanced rollout mechanism that adaptively allocate global and branch
sampling budget through entropy pre-monitoring, while imposing a branch penalty
on consecutive high-entropy tool-call steps to prevent over-branching issues;
and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient
operation into the high-entropy clipping term to preserve and properly rescale
gradients on high-entropy tokens, while incorporating entropy-aware advantage
estimation to prioritize learning on high-uncertainty tokens. Results across 14
challenging datasets show that AEPO consistently outperforms 7 mainstream RL
algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive
results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker
for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on
WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout
sampling diversity while maintaining stable policy entropy, facilitating
scalable web agent training.
Authors (14)
Guanting Dong
Licheng Bao
Zhongyuan Wang
Kangzhi Zhao
Xiaoxi Li
Jiajie Jin
+8 more
Submitted
October 16, 2025
Key Contributions
Proposes Agentic Entropy-Balanced Policy Optimization (AEPO), a novel agentic RL algorithm designed to address training collapse caused by excessive entropy signals. AEPO balances entropy in both rollout and policy update phases using a dynamic rollout mechanism with entropy pre-monitoring and a branch penalty, and an entropy-balanced policy optimization step.
Business Value
Enables the development of more robust and capable AI agents for complex tasks like web automation, customer service, and potentially robotic control, leading to increased efficiency and new service offerings.