arxiv_ai 93% Match Research Paper ML Researchers,NLP Engineers,AI Safety Researchers 2 weeks ago

RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

reinforcement-learning › rlhf

📄 Abstract

Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.

Authors (6)

Kaiwen Zha

Zhengqi Gao

Maohao Shen

Zhang-Wei Hong

Duane S. Boning

Dina Katabi

Submitted

May 21, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces Tango, a novel RL framework that concurrently trains an LLM generator and a verifier in an interleaved manner. A key innovation is its generative, process-level LLM verifier trained via RL, which co-evolves with the generator. This verifier is trained solely on outcome-level correctness rewards, avoiding explicit process-level annotations and addressing limitations of fixed or discriminatively trained verifiers.

Business Value

Leads to more reliable and robust LLMs capable of complex reasoning, reducing the risk of undesirable behaviors (reward hacking) and improving performance in critical applications.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

Moderate, as RL training can be complex and computationally intensive.

Limitations Addressed

Reward hacking and poor generalization of fixed or discriminatively trained verifiers in RL for LLMs.

Performance Gains

Improved reasoning capabilities and generalization compared to methods with fixed or discriminatively trained verifiers.

Technical Tags

reinforcement learningLLM reasoninggenerator-verifier trainingreward hackingprocess-level verificationoutcome-level rewardsinterleaved traininggenerative verifiersupervised fine-tuning (SFT)

Research Topics

Reinforcement Learning for LLMsAI AlignmentReasoning EnhancementModel Training

Methods & Architectures

Reinforcement Learning (RL)Interleaved training of generator and verifierGenerative process-level verifier Large Language Models (LLMs)Reinforcement Learning Agents

Applications & Tasks

Natural Language Processing AI Reasoning AI Safety Reward HackingPoor Generalization of VerifiersImproving LLM Reasoning Enhancing LLM ReasoningTraining VerifiersAligning LLMs

Related Fields

Reinforcement LearningNatural Language ProcessingAI AlignmentMachine Learning Theory

Keywords

reinforcement learningLLMreasoninggeneratorverifierreward hackingalignmentinterleaved traininggenerative modelprocess supervisionoutcome supervision

Academic Context

#Reinforcement Learning for LLMs#AI Alignment#Reasoning Enhancement#Model Training

Commercial Potential

Potential Products

More capable and aligned LLM assistantsAI systems for complex problem-solving

Target Industries

TechnologyCustomer SupportResearch & Development

Use Case Examples

Developing LLMs that can reliably solve complex multi-step reasoning problemsCreating AI systems that are less prone to generating harmful or nonsensical outputs

Competitive Edge

Offers a novel approach to RL training for LLMs by jointly training generator and verifier, addressing key limitations of prior methods.

Market Opportunity

Significant market interest in improving LLM reasoning and alignment.

Revenue Models

Licensing of aligned LLM models or platforms.

Resource Requirements

Compute Needs

High compute requirements for RL training of large models.

Data Requirements

Requires data for initial SFT and potentially for generating outcomes for RL training.

Deployment Constraints

RL training can be unstable; ensuring the verifier remains aligned and effective is crucial.

Scalability

Scalability depends on the efficiency of the RL algorithm and the LLM architecture.

Regulatory Considerations

Ensuring AI alignment and safety.

Production Readiness

Maturity Level

Research

Time to Market

2-3 years for robust deployment.

Patent Potential

Moderate, for the interleaved training mechanism and generative verifier design.

View Full Paper Back to Papers