Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper LLM researchers,AI alignment researchers,Developers of AI agents,ML engineers 1 week ago

One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning

large-language-models › alignment
📄 Abstract

Abstract: Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.
Authors (7)
Renhao Li
Jianhong Tu
Yang Su
Hamid Alinejad-Rokny
Derek F. Wong
Junyang Lin
+1 more
Submitted
October 30, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

Introduces ToolRM, a family of lightweight generative reward models specifically for LLM tool-use (function-calling) tasks. It proposes a novel pipeline to construct pairwise preference data (ToolPref-Pairwise-30K) and evaluates RMs on TRBench_BFCL, showing significant accuracy improvements over frontier models.

Business Value

Enables the creation of more reliable and capable AI agents that can interact with external tools and APIs, leading to more powerful applications and automation.