Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language model (LLM) agents have exhibited strong problem-solving
competence across domains like research and coding. Yet, it remains
underexplored whether LLM agents can tackle compounding real-world problems
that require a diverse set of tools to complete. Given a broad, heterogeneous
tool repository, LLM agents must not only select appropriate tools based on
task planning analysis but also strategically schedule the execution order to
ensure efficiency. This paper introduces TPS-Bench to benchmark the ability of
LLM agents in solving such problems that demand Tool Planning and Scheduling.
TPS-Bench collects 200 compounding tasks of two difficulty levels, based on a
tool repository containing hundreds of model context protocol (MCP) tools. In
particular, each task is composed of multiple subtasks, such as web search, map
navigation, calendar checking, etc., and each subtask can be completed by a
basic tool. Our evaluation emphasizes both task completion rate and efficiency.
The empirical studies on popular closed-source and open-source LLMs indicate
that most models can perform reasonable tool planning, but differ in
scheduling. For example, GLM-4.5 achieves an outperforming task completion rate
of 64.72% with extensive sequential tool calls, hence suffering from
significantly long execution time. By contrast, GPT-4o prioritizes parallel
tool calls but achieves only a 45.08% completion rate. Considering
reinforcement learning (RL) can be a viable way to improve the scheduling
efficiency without compromising performance, we perform an initial study on
Qwen3-1.7B and witness a 14% reduction in execution time alongside a 6% gain in
task completion rate based on rarely 100 RL training samples. Our code is
available https://github.com/hanwenxu1/mcp-agent.
Key Contributions
Introduces TPS-Bench, a novel benchmark for evaluating LLM agents' abilities in tool planning and scheduling for compounding real-world problems. This benchmark addresses the underexplored area of LLM agents tackling complex tasks requiring diverse tools, by providing 200 tasks based on hundreds of MCP tools.
Business Value
Enables more robust and capable AI agents that can automate complex workflows by intelligently selecting and sequencing tools, leading to increased efficiency and new application possibilities in various industries.