Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Training critiquing language models to assess and provide feedback on model
outputs is a promising way to improve LLMs for complex reasoning tasks.
However, existing approaches typically rely on stronger supervisors for
annotating critique data. To address this, we propose Critique-RL, an online RL
approach for developing critiquing language models without stronger
supervision. Our approach operates on a two-player paradigm: the actor
generates a response, the critic provides feedback, and the actor refines the
response accordingly. We first reveal that relying solely on indirect reward
signals from the actor's outputs for RL optimization often leads to
unsatisfactory critics: while their helpfulness (i.e., providing constructive
feedback) improves, the discriminability (i.e., determining whether a response
is high-quality or not) remains poor, resulting in marginal performance gains.
To overcome this, Critique-RL adopts a two-stage optimization strategy. In
stage I, it reinforces the discriminability of the critic with direct
rule-based reward signals; in stage II, it introduces indirect rewards based on
actor refinement to improve the critic's helpfulness, while maintaining its
discriminability via appropriate regularization. Extensive experiments across
various tasks and models show that Critique-RL delivers substantial performance
improvements. For example, it achieves a 9.02% gain on in-domain tasks and a
5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
Authors (18)
Zhiheng Xi
Jixuan Huang
Xin Guo
Boyang Hong
Dingwen Yang
Xiaoran Fan
+12 more
Submitted
October 28, 2025
Key Contributions
Critique-RL introduces an online RL approach for training critiquing language models without stronger supervision, using a two-stage optimization strategy. This strategy addresses the issue where critics improve in helpfulness but lack discriminability, leading to marginal performance gains, by focusing on both aspects.
Business Value
Enables the development of more reliable and helpful AI systems by allowing LLMs to self-improve through better feedback mechanisms, leading to higher quality outputs in complex tasks.