arxiv_ai 90% Match Research Paper Reinforcement learning researchers,AI agent developers,Robotics engineers,Machine learning practitioners 3 weeks ago

Agentic Entropy-Balanced Policy Optimization

reinforcement-learning › multi-agent

📄 Abstract

Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.

Authors (14)

Guanting Dong

Licheng Bao

Zhongyuan Wang

Kangzhi Zhao

Xiaoxi Li

Jiajie Jin

+8 more

Submitted

October 16, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Proposes Agentic Entropy-Balanced Policy Optimization (AEPO), a novel agentic RL algorithm designed to address training collapse caused by excessive entropy signals. AEPO balances entropy in both rollout and policy update phases using a dynamic rollout mechanism with entropy pre-monitoring and a branch penalty, and an entropy-balanced policy optimization step.

Business Value

Enables the development of more robust and capable AI agents for complex tasks like web automation, customer service, and potentially robotic control, leading to increased efficiency and new service offerings.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires significant computational resources for training RL agents. Integration into specific applications (e.g., web agents) is feasible.

Limitations Addressed

Addresses training collapse and excessive constraints in agentic RL caused by over-reliance on entropy signals, particularly in scenarios involving multi-turn tool use and long horizons.

Performance Gains

Prevents training collapse,Improves stability of agentic RL algorithms,Enhances long-horizon tool-use capabilities

Technical Tags

reinforcement learningagentic RLentropy regularizationpolicy optimizationtool useweb agentsmulti-turn taskslong-horizon planningtraining stabilityexploration-exploitation

Research Topics

Reinforcement LearningAI AgentsRoboticsAutonomous SystemsMachine Learning Theory

Methods & Architectures

Agentic Entropy-Balanced Policy Optimization (AEPO)Entropy-balanced rollout mechanismEntropy-Balanced Policy OptimizationEntropy pre-monitoringBranch penalty Agentic RL algorithms

Applications & Tasks

Web automation AI agents Robotics Complex task execution Training instability in agentic RLOver-reliance on entropy signalsBalancing exploration and exploitationImproving long-horizon tool use Multi-turn tool useLong-horizon task completionWeb navigation and interactionAutonomous decision making

Related Fields

Reinforcement LearningArtificial IntelligenceMachine LearningRoboticsControl TheoryWeb Scraping

Keywords

reinforcement learningagentic RLentropypolicy optimizationtool useweb agentslong horizontraining stabilitymulti-agentexplorationdecision makingAI agents

Academic Context

#Reinforcement Learning#AI Agents#Robotics#Autonomous Systems#Machine Learning Theory

Commercial Potential

Potential Products

Advanced web automation toolsAI-powered customer support agentsRobotic control systemsIntelligent assistants for complex workflows

Target Industries

TechnologyE-commerceCustomer ServiceRoboticsFinance

Use Case Examples

An AI agent that can autonomously book travel itineraries by interacting with multiple websites.A system that assists users in complex online form filling.Robots learning to perform multi-step assembly tasks.

Competitive Edge

Offers a more stable and effective training method for agentic RL compared to existing entropy-based approaches, enabling more complex and reliable agent behaviors.

Market Opportunity

Growing market for AI agents and automation solutions.

Revenue Models

Licensing of algorithmsdevelopment of specialized AI agent services.

Resource Requirements

Compute Needs

High, typical for deep reinforcement learning training.

Data Requirements

Requires environments for agent interaction (e.g., web environments, simulated tasks).

Deployment Constraints

Computational cost, sample efficiency, generalization to unseen tasks.

Scalability

Scalability depends on the efficiency of the RL algorithm and the complexity of the environment.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust applications.

Patent Potential

Moderate, for the novel AEPO algorithm.

View Full Paper Back to Papers