Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reward models (RMs) play a critical role in aligning large language models
(LLMs) with human preferences. Yet in the domain of tool learning, the lack of
RMs specifically designed for function-calling tasks has limited progress
toward more capable agentic AI. We introduce ToolRM, a family of lightweight
generative RMs tailored for general tool-use scenarios. To build these models,
we propose a novel pipeline that constructs pairwise preference data using
rule-based scoring and multidimensional sampling. This yields
ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique
tasks that supports reinforcement learning with verifiable feedback. To
evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on
the agentic evaluation suite BFCL. Trained on our constructed data, models from
the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially
outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward
judgments. Beyond training objectives, ToolRM generalizes to broader critique
tasks, including Best-of-N sampling and self-correction. Experiments on
ACEBench highlight its effectiveness and efficiency, enabling inference-time
scaling and reducing output token usage by over 66%. We release data and model
checkpoints to facilitate future research.
Authors (7)
Renhao Li
Jianhong Tu
Yang Su
Hamid Alinejad-Rokny
Derek F. Wong
Junyang Lin
+1 more
Submitted
October 30, 2025
Key Contributions
Introduces ToolRM, a family of lightweight generative reward models specifically for LLM tool-use (function-calling) tasks. It proposes a novel pipeline to construct pairwise preference data (ToolPref-Pairwise-30K) and evaluates RMs on TRBench_BFCL, showing significant accuracy improvements over frontier models.
Business Value
Enables the creation of more reliable and capable AI agents that can interact with external tools and APIs, leading to more powerful applications and automation.