Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Anchored Direct Preference Optimization (ADPO) is a unified framework that
generalizes Direct Preference Optimization (DPO) with soft preferences,
reference-policy anchoring, and groupwise extensions. While standard DPO
assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft
preference probabilities that encode uncertainty and mitigate gradient drift;
(ii) arbitrary reference-policy anchors that stabilize training via groupwise
shift invariance and implicit KL regularization; and (iii) listwise preference
modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry
objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields
three practical variants: pairwise anchored Soft-DPO, listwise anchored
Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed
noise. In contextual bandits, anchoring improves WinMass by 38-63% over
standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed
contamination (112% relative gain). In sequential reinforcement learning
(CartPole, LunarLander), anchoring improves noisy-preference performance by
15-29%, confirming transfer from single-step to multi-step settings.
Experiments with 10-256 parameter models provide clear guidance: use pairwise
anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for
extreme contamination.
Submitted
October 21, 2025
Key Contributions
ADPO is a unified framework that generalizes DPO by introducing soft preference probabilities and reference-policy anchoring. This framework stabilizes training, mitigates gradient drift, and allows for more flexible preference modeling (pairwise, listwise) using Plackett-Luce distributions, leading to significant performance improvements in contextual bandits.
Business Value
Enables more robust and stable training of AI systems that learn from human preferences, crucial for applications like LLM alignment, personalized recommendations, and autonomous systems.