Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper AI Researchers,ML Engineers,NLP Practitioners,AI Safety Researchers 2 weeks ago

Why DPO is a Misspecified Estimator and How to Fix It

large-language-models › alignment
📄 Abstract

Abstract: Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.
Authors (3)
Aditya Gopalan
Sayak Ray Chowdhury
Debangshu Banerjee
Submitted
October 23, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper demonstrates that Direct Preference Optimization (DPO) can be a misspecified estimator when the true reward function is not realizable by the policy class, leading to failure modes. It provides a theoretical characterization relating DPO to natural gradient steps in policy space and proposes AuxDPO, which introduces auxiliary variables to mitigate misspecification and guide the model towards the RLHF solution more effectively. This offers a principled way to improve direct alignment methods.

Business Value

Provides a more stable and reliable method for aligning LLMs using preference data, reducing the risk of undesirable model behaviors and improving the quality of aligned models. This can lead to more trustworthy and predictable AI systems.