arxiv_ai 95% Match Research Paper LLM Researchers,AI Safety Researchers,NLP Engineers,ML Researchers 1 week ago

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

large-language-models › alignment

📄 Abstract

Abstract: Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

Authors (18)

Zhiheng Xi

Jixuan Huang

Xin Guo

Boyang Hong

Dingwen Yang

Xiaoran Fan

+12 more

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Critique-RL introduces an online RL approach for training critiquing language models without stronger supervision, using a two-stage optimization strategy. This strategy addresses the issue where critics improve in helpfulness but lack discriminability, leading to marginal performance gains, by focusing on both aspects.

Business Value

Enables the development of more reliable and helpful AI systems by allowing LLMs to self-improve through better feedback mechanisms, leading to higher quality outputs in complex tasks.

Paper Metadata

Innovation Type

Novel Training Methodology

Deployment Feasibility

Moderate. Requires careful implementation of the two-stage RL process and integration into LLM training pipelines.

Limitations Addressed

Reliance on stronger supervisors for critique data annotation, poor critic discriminability despite improved helpfulness in existing RL approaches.

Performance Gains

Claims improved critic discriminability and subsequent LLM performance gains compared to single-stage RL.

Technical Tags

critiquing language modelsreinforcement learning (RL)two-stage optimizationLLM feedbackresponse generationresponse refinementhelpfulnessdiscriminabilitysupervisioncomplex reasoning

Research Topics

LLM AlignmentReinforcement LearningAI SafetyModel EvaluationNatural Language Generation

Methods & Architectures

Two-Stage Reinforcement LearningActor-Critic ParadigmOnline RL Large Language Models (LLMs)

Applications & Tasks

AI Assistants Content Generation Code Generation Complex Reasoning Tasks Improving LLM Feedback QualityTraining Critiquing Models without Strong SupervisionEnhancing LLM Reasoning Capabilities Generating constructive feedback on LLM outputsRefining LLM responses based on critiqueImproving LLM performance on complex reasoning

Related Fields

Natural Language ProcessingReinforcement LearningAI SafetyMachine LearningLarge Language Models

Keywords

CritiqueReinforcement LearningLLM AlignmentAI SafetyFeedbackSupervisionReasoningLanguage ModelsTwo-Stage OptimizationHelpfulnessDiscriminability

Academic Context

#LLM Alignment#Reinforcement Learning#AI Safety#Model Evaluation#Natural Language Generation

Commercial Potential

Potential Products

AI Systems with Self-Correction CapabilitiesTools for LLM Evaluation and ImprovementEnhanced AI Assistants

Target Industries

TechnologySoftware DevelopmentCustomer ServiceContent Creation

Use Case Examples

An LLM that can critique its own generated code and improve itAn AI assistant that provides detailed, constructive feedback on user queriesImproving the factual accuracy of LLM-generated text through self-critique

Competitive Edge

Addresses the critical issue of critic quality in RL-based LLM improvement by introducing a two-stage optimization that balances helpfulness and discriminability, potentially leading to more effective alignment.

Market Opportunity

Significant investment and research in LLM alignment and safety.

Revenue Models

Licensing of improved LLM architectures or training methodologies.

Resource Requirements

Compute Needs

High (for RL training)

Data Requirements

LLM-generated responses, potentially human feedback or automated metrics for reward signals.

Deployment Constraints

Complexity of the two-stage RL training and potential for reward hacking.

Scalability

Scalability depends on the efficiency of the RL training loop and the LLM size.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

High

View Full Paper Back to Papers