Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper RL researchers,ML engineers,AI safety researchers,Developers of preference-based learning systems 2 weeks ago

ADPO: Anchored Direct Preference Optimization

reinforcement-learning › rlhf
📄 Abstract

Abstract: Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.
Authors (1)
Wang Zixian
Submitted
October 21, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

ADPO is a unified framework that generalizes DPO by introducing soft preference probabilities and reference-policy anchoring. This framework stabilizes training, mitigates gradient drift, and allows for more flexible preference modeling (pairwise, listwise) using Plackett-Luce distributions, leading to significant performance improvements in contextual bandits.

Business Value

Enables more robust and stable training of AI systems that learn from human preferences, crucial for applications like LLM alignment, personalized recommendations, and autonomous systems.