arxiv_ml 95% Match Research Paper LLM researchers,Reinforcement learning practitioners,AI engineers developing agents,NLP researchers 2 weeks ago

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

large-language-models › reasoning

📄 Abstract

Abstract: This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents in long-horizon, multi-turn scenarios. Although RL algorithms such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) have been widely applied to train multi-turn LLM agents, they typically rely only on sparse outcome rewards and lack dense intermediate signals across multiple decision steps, limiting their performance on complex reasoning tasks. To bridge this gap, we present the first systematic study of \textit{turn-level reward design} for multi-turn RL algorithms and agent applications. By integrating turn-level rewards, we extend GRPO and PPO to their respective multi-turn variants, enabling fine-grained credit assignment. We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge. Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards. Both training and validation reward curves illustrate that our method achieves \textit{greater stability}, \textit{faster convergence}, and \textit{higher accuracy}. Numerical results across diverse question-answering datasets further show that our approach consistently delivers highest answer correctness and 100\% format correctness.

Authors (11)

Quan Wei

Siliang Zeng

Chenliang Li

William Brown

Oana Frunza

Wei Deng

+5 more

Submitted

May 17, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper systematically studies 'turn-level reward design' for Reinforcement Learning (RL) in multi-turn LLM agents. By integrating turn-level rewards (verifiable and LLM-as-judge), it enhances fine-grained credit assignment, improving LLM reasoning capabilities in complex, long-horizon scenarios.

Business Value

Enables the development of more capable and reliable AI agents that can perform complex reasoning tasks over extended interactions, leading to better conversational AI, advanced search tools, and more sophisticated autonomous systems.

Paper Metadata

Innovation Type

Methodology/Framework

Deployment Feasibility

Moderate. Requires integration of RL training pipelines with LLMs and careful reward function design.

Limitations Addressed

The reliance on sparse outcome rewards in traditional RL for LLMs, which hinders performance on complex reasoning tasks requiring multi-step decision-making. Addresses the challenge of credit assignment in long-horizon tasks.

Performance Gains

Improves LLM agent performance on multi-turn reasoning tasks.

Technical Tags

Large Language Models (LLMs)Reinforcement Learning (RL)Multi-Turn ReasoningAgent BehaviorTurn-Level RewardsReward DesignCredit AssignmentReasoning-Augmented SearchLLM-as-JudgeVerifiable Rewards

Research Topics

LLM AgentsReinforcement Learning for LLMsMulti-Agent SystemsReasoning in AIReward Engineering

Methods & Architectures

Turn-Level Reward DesignReinforcement Learning (GRPO, PPO)LLM-as-Judge RewardVerifiable Reward DesignMulti-Turn Agent Training Large Language Models (LLMs)Reasoning-Augmented Search Agents

Applications & Tasks

Artificial Intelligence Natural Language Processing Agent Systems Search and Information Retrieval Enhancing LLM reasoning in long-horizon, multi-turn scenariosLack of dense intermediate signals in RL for LLMsSparse outcome rewards limiting performance Training LLM agents for complex reasoningImproving multi-turn decision-makingFine-grained credit assignment in RL

Datasets & Benchmarks

Benchmarks

Multi-turn reasoning-augmented search agents

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingReinforcement LearningAgent-Based Systems

Keywords

LLM AgentsReinforcement LearningMulti-TurnReasoningReward DesignCredit AssignmentLLM-as-JudgeAgent TrainingGRPOPPO

Academic Context

#LLM Agents#Reinforcement Learning for LLMs#Multi-Agent Systems#Reasoning in AI#Reward Engineering

Commercial Potential

Potential Products

Advanced LLM agent frameworksTools for RL-based LLM fine-tuningReasoning-enhanced search engines

Target Industries

TechnologyCustomer ServiceInformation RetrievalGamingRobotics

Use Case Examples

Developing AI assistants that can plan and execute multi-step tasksCreating agents for complex game playing or strategy developmentImproving the coherence and reasoning of long-form text generation

Competitive Edge

Addresses a key limitation in current LLM agent training (sparse rewards) by introducing a systematic approach to turn-level reward design, enabling more effective learning of complex reasoning.

Market Opportunity

Rapidly growing market for advanced LLM applications and AI agents.

Revenue Models

Development of specialized AI agentslicensing of training methodologies.

Resource Requirements

Compute Needs

High, due to LLM size and RL training complexity.

Data Requirements

Requires environments or tasks that allow for multi-turn interactions and reward signals.

Deployment Constraints

Computational cost of running LLMs and RL training loops.

Scalability

Focuses on improving the learning process for complex, multi-turn interactions.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for integration into agent frameworks.

Patent Potential

Moderate, for the specific reward design strategies and their application.

View Full Paper Back to Papers