arxiv_ai 95% Match Research Paper RL researchers,ML engineers,AI safety researchers,Developers of preference-based learning systems 2 weeks ago

ADPO: Anchored Direct Preference Optimization

reinforcement-learning › rlhf

📄 Abstract

Abstract: Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.

Authors (1)

Wang Zixian

Submitted

October 21, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

ADPO is a unified framework that generalizes DPO by introducing soft preference probabilities and reference-policy anchoring. This framework stabilizes training, mitigates gradient drift, and allows for more flexible preference modeling (pairwise, listwise) using Plackett-Luce distributions, leading to significant performance improvements in contextual bandits.

Business Value

Enables more robust and stable training of AI systems that learn from human preferences, crucial for applications like LLM alignment, personalized recommendations, and autonomous systems.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

The framework is designed to be practical and integrates well with existing DPO implementations, suggesting good feasibility.

Limitations Addressed

Assumptions of hard binary labels and pairwise comparisons in standard DPO, gradient drift, and instability during training.

Performance Gains

38-63% improvement in WinMass over standard DPO in contextual bandits,Achieved 0.68 vs 0.32 under heavy-tailed noise using KDE smoothing

Technical Tags

Reinforcement LearningDirect Preference Optimization (DPO)Soft PreferencesReference-Policy AnchoringGroupwise ExtensionsPlackett-Luce DistributionContextual BanditsGradient DriftKL RegularizationMachine Learning

Research Topics

Reinforcement LearningMachine Learning TheoryOptimizationDecision MakingAI Alignment

Methods & Architectures

Anchored Direct Preference Optimization (ADPO)Soft preference probabilitiesReference-policy anchoringListwise preference modelingPlackett-Luce distributionsKDE-based smoothing Policy Optimization Models

Applications & Tasks

Machine Learning Robotics Recommendation Systems Natural Language Processing Instability in DPOGradient driftHandling soft preferencesStabilizing trainingImproving performance in contextual bandits Policy optimizationLearning from preferencesContextual bandit problemsImproving LLM alignment

Datasets & Benchmarks

Benchmarks

WinMass: 38-63% improvement over standard DPO • KDE smoothing: 0.68 vs 0.32 under heavy-tailed noise

WinMassAccuracy (implied)

Related Fields

Reinforcement LearningMachine LearningOptimization TheoryDecision Sciences

Keywords

Reinforcement LearningDirect Preference OptimizationDPOSoft PreferencesAnchoringContextual BanditsPolicy OptimizationMachine LearningAI AlignmentPlackett-LuceGradient Stability

Academic Context

#Reinforcement Learning#Machine Learning Theory#Optimization#Decision Making#AI Alignment

Commercial Potential

Potential Products

More stable and effective LLM alignment toolsAdvanced recommendation enginesPersonalized learning systems

Target Industries

TechnologyAI DevelopmentE-commerceEducationGaming

Use Case Examples

Training chatbots to follow user instructions more reliably.Developing recommendation systems that better capture user taste.Optimizing agent behavior in complex environments based on nuanced feedback.

Competitive Edge

Offers a more robust and generalized alternative to standard DPO, addressing key limitations and improving performance across various preference-learning tasks.

Market Opportunity

Growing demand for reliable AI alignment and preference learning techniques.

Revenue Models

Licensing of the ADPO algorithm/frameworkintegration into AI development platforms.

Resource Requirements

Compute Needs

Moderate (standard RL training requirements)

Data Requirements

Preference data (pairwise or listwise), potentially contextual bandit data.

Deployment Constraints

Requires careful tuning of hyperparameters, especially for anchoring and smoothing.

Scalability

Scales with the underlying policy optimization algorithm and the complexity of the preference data.

Regulatory Considerations

Ethical considerations for preference learning and AI alignment.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-3 years

Patent Potential

Moderate (novel optimization framework)

View Full Paper Back to Papers