arxiv_ai 95% Match Research Paper LLM researchers,AI alignment researchers,Developers of AI agents,ML engineers 1 week ago

One Model to Critique Them All: Rewarding Agentic Tool-Use via Efficient Reasoning

large-language-models › alignment

📄 Abstract

Abstract: Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight generative RMs tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging dataset of critique tasks that supports reinforcement learning with verifiable feedback. To evaluate tool-use RMs, we also introduce TRBench$_{BFCL}$, a benchmark built on the agentic evaluation suite BFCL. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 14.28% higher accuracy, substantially outperforming frontier models such as Claude 4 and OpenAI o3 in pairwise reward judgments. Beyond training objectives, ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling and reducing output token usage by over 66%. We release data and model checkpoints to facilitate future research.

Authors (7)

Renhao Li

Jianhong Tu

Yang Su

Hamid Alinejad-Rokny

Derek F. Wong

Junyang Lin

+1 more

Submitted

October 30, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Introduces ToolRM, a family of lightweight generative reward models specifically for LLM tool-use (function-calling) tasks. It proposes a novel pipeline to construct pairwise preference data (ToolPref-Pairwise-30K) and evaluates RMs on TRBench_BFCL, showing significant accuracy improvements over frontier models.

Business Value

Enables the creation of more reliable and capable AI agents that can interact with external tools and APIs, leading to more powerful applications and automation.

Paper Metadata

Innovation Type

Reward Model Development

Deployment Feasibility

Feasible for LLM developers; requires expertise in RLHF and tool integration.

Limitations Addressed

The lack of specialized reward models for function-calling tasks, which has limited the progress of capable agentic AI that can effectively use tools.

Performance Gains

Models trained on ToolPref-Pairwise-30K achieve up to 14.28% higher accuracy, substantially outperforming frontier models like Claude 4 and OpenAI o3.

Technical Tags

reward modelstool usefunction callingLLM alignmentreinforcement learningpreference databenchmark datasetgenerative reward modelsverifiable feedbackagentic AI

Research Topics

LLM AlignmentAgentic AIReinforcement Learning from Human Feedback (RLHF)Tool Use in LLMsAI Safety

Methods & Architectures

Generative Reward Models (RMs)Preference data construction pipelineRule-based scoringMultidimensional samplingReinforcement LearningBenchmark evaluation Qwen3-4B/8B

Applications & Tasks

AI Agents Tool Use Software Development Natural Language Interfaces Aligning LLMs for Tool UseDeveloping Effective Reward Models for Function CallingImproving Agentic AI Capabilities Training LLMs to effectively use tools (function calling)Developing lightweight generative reward models for tool-use scenariosEvaluating tool-use reward models

Datasets & Benchmarks

Datasets

ToolPref-Pairwise-30K

Benchmarks

TRBench_BFCL: A benchmark built on the agentic evaluation suite BFCL.

AccuracyPairwise reward accuracyPerformance improvement

Related Fields

Artificial IntelligenceMachine LearningReinforcement LearningNatural Language ProcessingAI Safety

Keywords

Reward ModelsTool UseFunction CallingLLM AlignmentAgentic AIReinforcement LearningPreference DataToolPref-Pairwise-30KTRBench_BFCLQwen3AI Safety

Academic Context

#LLM Alignment#Agentic AI#Reinforcement Learning from Human Feedback (RLHF)#Tool Use in LLMs#AI Safety

Companies & Organizations

Companies Mentioned

OpenAI

Commercial Potential

Potential Products

AI agents with advanced tool-use capabilitiesPlatforms for training and evaluating agentic AIAPIs for integrating LLM tool-use into applications

Target Industries

TechnologySoftware DevelopmentCustomer ServiceAutomation

Use Case Examples

AI assistants that can book flights or manage calendars by calling external APIs.Automated code generation tools that use libraries and functions.Customer support bots that can access knowledge bases or perform actions.

Competitive Edge

Addresses a critical gap in LLM alignment for tool use by providing specialized reward models and a benchmark, enabling more capable agentic AI.

Market Opportunity

Rapidly growing market for AI agents and LLM-powered automation.

Revenue Models

Licensing of reward modelsAI agent development servicesplatform fees.

Resource Requirements

Compute Needs

Requires compute for training reward models and for RL training of LLMs.

Data Requirements

Requires construction of a large, diverse dataset of preference pairs for tool-use scenarios.

Deployment Constraints

Requires careful integration of reward models into the LLM training pipeline; performance depends on the quality of the reward signal.

Scalability

The generative nature of the reward models and the benchmark's design aim for scalability.

Production Readiness

Maturity Level

Research/Development

Time to Market

Medium-term (for practical applications)

Patent Potential

Moderate (for novel reward modeling techniques and data construction)

View Full Paper Back to Papers