arxiv_ai 90% Match Research Paper RL researchers,AI agent developers,Robotics engineers,LLM researchers 3 weeks ago

DeepPlanner: Scaling Planning Capability for Deep Research Agents via Advantage Shaping

reinforcement-learning › robotics-rl

📄 Abstract

Abstract: Large language models (LLMs) augmented with multi-step reasoning and action generation abilities have shown promise in leveraging external tools to tackle complex tasks that require long-horizon planning. However, existing approaches either rely on implicit planning in the reasoning stage or introduce explicit planners without systematically addressing how to optimize the planning stage. As evidence, we observe that under vanilla reinforcement learning (RL), planning tokens exhibit significantly higher entropy than other action tokens, revealing uncertain decision points that remain under-optimized. To address this, we propose DeepPlanner, an end-to-end RL framework that effectively enhances the planning capabilities of deep research agents. Our approach shapes token-level advantage with an entropy-based term to allocate larger updates to high entropy tokens, and selectively upweights sample-level advantages for planning-intensive rollouts. Extensive experiments across seven deep research benchmarks demonstrate that DeepPlanner improves planning quality and achieves state-of-the-art results under a substantially lower training budget.

Authors (9)

Wei Fan

Wenlin Yao

Zheng Li

Feng Yao

Xin Liu

Liang Qiu

+3 more

Submitted

October 14, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Proposes DeepPlanner, an end-to-end RL framework that enhances planning capabilities in LLM agents. It uses advantage shaping with an entropy-based term to allocate larger updates to high-entropy planning tokens and selectively upweights sample-level advantages for planning-intensive rollouts.

Business Value

Enables the development of more capable AI agents that can autonomously perform complex, multi-step tasks, potentially automating research and development processes.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate. Requires integration of LLMs with RL training pipelines. The advantage shaping mechanism needs careful tuning.

Limitations Addressed

Implicit planning in LLM reasoning stages,Lack of systematic optimization for explicit planners,High entropy (uncertainty) in planning tokens under vanilla RL

Performance Gains

Effectively enhances planning capabilities, leading to improved performance on complex tasks requiring long-horizon planning.

Technical Tags

deep reinforcement learninglong-horizon planningtool useLLM agentsadvantage shapingentropy regularizationplanning tokensaction generationRL optimizationresearch agents

Research Topics

Reinforcement LearningLong-Horizon PlanningAgent SystemsLLM Integration

Methods & Architectures

Deep Reinforcement Learning (DRL)Advantage ShapingEntropy-based RegularizationToken-level Advantage Updates LLM-based Agents

Applications & Tasks

AI Research Agents Robotics Complex Task Solving Long-Horizon Planning ChallengesUnder-optimized Planning StagesHigh Entropy in Planning Tokens Enhancing planning capabilities of LLM agentsTackling complex tasks requiring multi-step reasoning and tool useOptimizing decision points in planning

Datasets & Benchmarks

Benchmarks

Seven benchmarks (specific names not provided in abstract)

Performance on complex tasksPlanning efficiencyTask success rate

Related Fields

Reinforcement LearningLarge Language ModelsRoboticsAI AgentsPlanning

Keywords

reinforcement learningLLM agentsplanninglong-horizon taskstool useadvantage shapingentropydeep learningAI agentsroboticsresearch automation

Academic Context

#Reinforcement Learning#Long-Horizon Planning#Agent Systems#LLM Integration

Commercial Potential

Potential Products

AI research assistantsAutonomous problem-solving agentsRobotic control systems

Target Industries

TechnologyResearch & DevelopmentRoboticsPharmaceuticals

Use Case Examples

An AI agent designing experimentsAn AI system automating code developmentRobots performing complex assembly tasks

Competitive Edge

Addresses a critical gap in LLM agent capabilities by systematically improving their long-horizon planning and decision-making through RL optimization.

Market Opportunity

The market for advanced AI agents and automation tools is rapidly expanding.

Revenue Models

Licensing of AI agent platformsdevelopment of specialized AI solutions.

Resource Requirements

Compute Needs

High, due to the combination of LLMs and deep reinforcement learning training.

Data Requirements

Requires environments or tasks suitable for RL training, potentially involving tool use and complex state spaces.

Deployment Constraints

Training complexity and stability,Computational cost of RL training,Generalization to unseen tasks

Scalability

Scalability depends on the efficiency of the RL algorithm and the LLM architecture.

Production Readiness

Maturity Level

Research

Time to Market

3-5 years for robust deployment in complex applications.

Patent Potential

Moderate, for the DeepPlanner framework and advantage shaping techniques.

View Full Paper Back to Papers