arxiv_ai 90% Match Research Paper AI Researchers,LLM Developers,AI System Evaluators 19 hours ago

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks

large-language-models › reasoning

📄 Abstract

Abstract: Large language model (LLM) agents have exhibited strong problem-solving competence across domains like research and coding. Yet, it remains underexplored whether LLM agents can tackle compounding real-world problems that require a diverse set of tools to complete. Given a broad, heterogeneous tool repository, LLM agents must not only select appropriate tools based on task planning analysis but also strategically schedule the execution order to ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of LLM agents in solving such problems that demand Tool Planning and Scheduling. TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a tool repository containing hundreds of model context protocol (MCP) tools. In particular, each task is composed of multiple subtasks, such as web search, map navigation, calendar checking, etc., and each subtask can be completed by a basic tool. Our evaluation emphasizes both task completion rate and efficiency. The empirical studies on popular closed-source and open-source LLMs indicate that most models can perform reasonable tool planning, but differ in scheduling. For example, GLM-4.5 achieves an outperforming task completion rate of 64.72% with extensive sequential tool calls, hence suffering from significantly long execution time. By contrast, GPT-4o prioritizes parallel tool calls but achieves only a 45.08% completion rate. Considering reinforcement learning (RL) can be a viable way to improve the scheduling efficiency without compromising performance, we perform an initial study on Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in task completion rate based on rarely 100 RL training samples. Our code is available https://github.com/hanwenxu1/mcp-agent.

Key Contributions

Introduces TPS-Bench, a novel benchmark for evaluating LLM agents' abilities in tool planning and scheduling for compounding real-world problems. This benchmark addresses the underexplored area of LLM agents tackling complex tasks requiring diverse tools, by providing 200 tasks based on hundreds of MCP tools.

Business Value

Enables more robust and capable AI agents that can automate complex workflows by intelligently selecting and sequencing tools, leading to increased efficiency and new application possibilities in various industries.

Paper Metadata

Innovation Type

Benchmark/Dataset

Deployment Feasibility

High, as it focuses on evaluating existing LLM agent capabilities rather than proposing a new model architecture.

Limitations Addressed

Lack of comprehensive evaluation for LLM agents' ability to plan and schedule the use of diverse tools for complex, compounding real-world problems.

Technical Tags

LLM AgentsTool UseTask PlanningSchedulingBenchmarkingHeterogeneous ToolsCompounding TasksModel Context Protocol (MCP)AI AgentsProblem Solving

Research Topics

AI Agent CapabilitiesTool IntegrationComplex Task DecompositionBenchmarking AI SystemsLLM Reasoning

Methods & Architectures

BenchmarkingTask DecompositionTool SelectionExecution Scheduling Large Language Models (LLMs)

Applications & Tasks

Real-world problem solving AI Agent Development Complex task completionTool planning and schedulingHandling heterogeneous tools Tool PlanningTask SchedulingCompounding Task Solving

Datasets & Benchmarks

Datasets

TPS-Bench

Task completion rateEfficiency

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingSoftware Engineering

Keywords

LLM agentstool useplanningschedulingbenchmarkingcompounding tasksheterogeneous toolsAI evaluationreal-world problemsMCP toolstask decompositionAI capabilities

Academic Context

#AI Agent Capabilities#Tool Integration#Complex Task Decomposition#Benchmarking AI Systems#LLM Reasoning

Commercial Potential

Potential Products

Advanced AI assistantsAutomated workflow systemsIntelligent agents for complex tasks

Target Industries

TechnologySoftware DevelopmentCustomer ServiceResearch

Use Case Examples

AI agent planning a research project using various online toolsAI agent scheduling appointments and managing calendarsAI agent orchestrating software development tasks

Competitive Edge

Positions itself as a benchmark for a specific, complex capability (tool planning and scheduling) that is currently underexplored in LLM agent evaluation.

Resource Requirements

Compute Needs

Not specified, but likely requires significant compute for LLM inference.

Data Requirements

Requires access to the TPS-Bench dataset and a tool repository.

Deployment Constraints

Depends on the LLM agent's underlying architecture and the availability/accessibility of the tools it needs to use.

Scalability

The benchmark itself is scalable with more tasks and tools. The LLM agents' scalability depends on their architecture.

Production Readiness

Maturity Level

Research/Development

Patent Potential

Low, as it is a benchmark and evaluation framework.

View Full Paper Back to Papers