arxiv_ai 95% Match Research Paper AI Researchers,ML Engineers,NLP Practitioners 2 weeks ago

Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning

reinforcement-learning › rlhf

📄 Abstract

Abstract: Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence particularly preference cycles has been largely unaddressed. To address this gap, this work introduces an end to end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. The code for our method is available at https://github.com/modelscope/RM-Gallery.

Authors (11)

Boyin Liu

Zhuo Zhang

Sen Huang

Lipeng Xie

Qingxu Fu

Haoran Chen

+5 more

Submitted

October 17, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This work introduces a novel framework to detect and resolve logical inconsistencies, specifically preference cycles, in LLM judge feedback for stable reinforcement learning. It proposes the Conflict Detection Rate (CDR) metric and the Deconflicted Graph Rewards (DGR) signal-purification framework, which transforms preference graphs into conflict-free DAGs to generate a coherent reward signal, addressing a critical gap in prior work that focused on judge accuracy.

Business Value

Improves the reliability and stability of training large language models using AI feedback, potentially reducing training costs and improving model performance for applications requiring aligned AI behavior.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as it integrates into the existing RL training loop and uses standard graph manipulation techniques.

Limitations Addressed

Destabilizing judgment inconsistencies and preference cycles in LLM feedback for reinforcement learning, which were largely unaddressed by prior work focusing only on judge accuracy.

Technical Tags

reinforcement learningLLM feedbackpreference cyclesreward modelingpolicy optimizationgraph theorydirected acyclic graphsconflict detectionsignal purificationhuman feedback

Research Topics

AI AlignmentReinforcement Learning StabilityLLM EvaluationPreference LearningReward Signal Design

Methods & Architectures

Conflict Detection Rate (CDR)Deconflicted Graph Rewards (DGR)Preference Graph ConstructionDirected Acyclic Graph (DAG) TransformationPolicy Optimization

Applications & Tasks

AI Alignment Large Language Model Training Instability in RLJudgment InconsistenciesPreference CyclesReward Signal Corruption Aligning LLMsStable Reinforcement LearningResolving Judgment Conflicts

Related Fields

Machine LearningNatural Language ProcessingAI EthicsControl Theory

Keywords

reinforcement learningLLMAI feedbackalignmentpreference learningreward modelingstabilitycyclesgraphDAGconflict detectionsignal purificationpolicy optimization

Academic Context

#AI Alignment#Reinforcement Learning Stability#LLM Evaluation#Preference Learning#Reward Signal Design

Commercial Potential

Potential Products

RLHF training optimization toolsAI alignment frameworks

Target Industries

TechnologyAI Development

Use Case Examples

Training safer and more reliable LLMsImproving consistency in AI-generated content

Competitive Edge

Addresses a specific failure mode (preference cycles) in LLM feedback for RL that existing methods focusing on judge accuracy do not handle.

Market Opportunity

Large, driven by the growing LLM market and the need for alignment.

Revenue Models

Licensing of the frameworkconsulting services for LLM training.

Resource Requirements

Compute Needs

Likely moderate to high, depending on the scale of the LLM and RL training.

Data Requirements

Requires preference data generated by LLM judges.

Deployment Constraints

Requires a stable LLM judge and a compatible RL training environment.

Scalability

The graph-based approach should scale reasonably well with the number of judgments, but computational complexity of graph algorithms needs consideration.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate, for the novel metrics and purification framework.

View Full Paper Back to Papers