arxiv_cl 90% Match Research Paper AI Researchers,Machine Learning Engineers,Legal Tech Developers,Information Retrieval Specialists 1 week ago

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

reinforcement-learning › multi-agent

📄 Abstract

Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

Authors (2)

Vivek Kalyan

Martin Andrews

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This work demonstrates that Reinforcement Learning significantly enhances the capabilities of LLM agents for long-horizon, multi-turn tasks, outperforming prompt-based approaches. The RL-trained agent achieved higher accuracy on a legal document search benchmark, highlighting the benefits of learning from experience.

Business Value

Enables more sophisticated and efficient AI agents for complex information retrieval and task completion, particularly in specialized domains like legal research.

Paper Metadata

Innovation Type

Methodology Improvement

Deployment Feasibility

Moderate, requires RL training infrastructure and careful integration with LLMs and tools.

Limitations Addressed

Limitations of prompt-based approaches for complex, multi-turn tasks requiring learning from experience.

Performance Gains

85% accuracy vs 78% accuracy for frontier class models.

Technical Tags

Reinforcement Learning (RL)Long-horizon tasksMulti-turn agentsLLM agentsSearch agentsLegal document searchPrompt-based learningTool use

Research Topics

Reinforcement LearningAgent-based SystemsInformation RetrievalLarge Language ModelsSequential Decision Making

Methods & Architectures

Reinforcement Learning trainingPrompt-based learningMulti-turn interactionTool integration LLM agentsReinforcement Learning trained models

Applications & Tasks

Information Retrieval Legal Tech AI Agents Search Engines Improving LLM agent capabilities for complex tasksLearning optimal strategies for long-horizon tasksEnhancing search efficiency and accuracy Long-horizon multi-turn searchLegal document analysisComplex task solving with LLM agents

Datasets & Benchmarks

Datasets

Legal document search benchmark

Benchmarks

Legal document search: 85% (RL-trained) vs 78% (frontier class models)

Accuracy

Related Fields

Machine LearningArtificial IntelligenceInformation RetrievalLegal InformaticsAgent Systems

Keywords

Reinforcement LearningLLM agentsmulti-turnlong-horizonsearchlegal techpromptingtool usesequential decision makingAI agentslearning from experience

Academic Context

#Reinforcement Learning#Agent-based Systems#Information Retrieval#Large Language Models#Sequential Decision Making

Commercial Potential

Potential Products

Advanced legal research assistantsIntelligent search agents for complex domainsRL-powered task completion systems

Target Industries

Legal ServicesTechnologyResearchFinance

Use Case Examples

Automating complex legal discovery processesBuilding AI agents that can perform multi-step researchDeveloping more capable virtual assistants for specialized knowledge domains

Competitive Edge

Advances the state-of-the-art in LLM agent capabilities by leveraging RL for long-horizon tasks, surpassing current prompt-based methods.

Market Opportunity

Growing demand for advanced AI agents and automation tools.

Revenue Models

Licensing of AI agent technologyservice-based solutions.

Resource Requirements

Compute Needs

High, requires significant compute for RL training.

Data Requirements

Access to relevant task-specific datasets (e.g., legal documents).

Deployment Constraints

Integration complexity, potential for unexpected behavior in RL agents.

Scalability

Scalability depends on the RL algorithm and the LLM's capacity.

Regulatory Considerations

Potential ethical considerations for autonomous agents in sensitive domains.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for robust deployment.

View Full Paper Back to Papers