Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Aligning language models using LLM judge feedback offers a scalable
alternative to human annotation, yet is plagued by judgment inconsistencies
that destabilize reinforcement learning. While prior work has focused on judge
accuracy, the critical issue of logical coherence particularly preference
cycles has been largely unaddressed. To address this gap, this work introduces
an end to end framework to systematically detect and resolve these
inconsistencies within the reinforcement learning training loop. Our framework
features two core contributions: the Conflict Detection Rate (CDR), a novel
metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a
signal-purification framework that eliminates cycles before policy
optimization. DGR constructs preference graphs from raw judgments, transforms
them into conflict-free Directed Acyclic Graphs (DAGs), and generates a
logically coherent reward signal compatible with any policy optimizer.
Experiments confirm that our framework significantly improves training
stability and model performance over strong baselines, establishing logical
consistency as a crucial and now-addressable dimension of AI feedback. The code
for our method is available at https://github.com/modelscope/RM-Gallery.
Authors (11)
Boyin Liu
Zhuo Zhang
Sen Huang
Lipeng Xie
Qingxu Fu
Haoran Chen
+5 more
Submitted
October 17, 2025
Key Contributions
This work introduces a novel framework to detect and resolve logical inconsistencies, specifically preference cycles, in LLM judge feedback for stable reinforcement learning. It proposes the Conflict Detection Rate (CDR) metric and the Deconflicted Graph Rewards (DGR) signal-purification framework, which transforms preference graphs into conflict-free DAGs to generate a coherent reward signal, addressing a critical gap in prior work that focused on judge accuracy.
Business Value
Improves the reliability and stability of training large language models using AI feedback, potentially reducing training costs and improving model performance for applications requiring aligned AI behavior.