Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Direct alignment algorithms such as Direct Preference Optimization (DPO)
fine-tune models based on preference data, using only supervised learning
instead of two-stage reinforcement learning with human feedback (RLHF). We show
that DPO encodes a statistical estimation problem over reward functions induced
by a parametric policy class. When the true reward function that generates
preferences cannot be realized via the policy class, DPO becomes misspecified,
resulting in failure modes such as preference order reversal, worsening of
policy reward, and high sensitivity to the input preference data distribution.
On the other hand, we study the local behavior of two-stage RLHF for a
parametric class and relate it to a natural gradient step in policy space. Our
fine-grained geometric characterization allows us to propose AuxDPO, which
introduces additional auxiliary variables in the DPO loss function to help move
towards the RLHF solution in a principled manner and mitigate the
misspecification in DPO. We empirically demonstrate the superior performance of
AuxDPO on didactic bandit settings as well as LLM alignment tasks.
Authors (3)
Aditya Gopalan
Sayak Ray Chowdhury
Debangshu Banerjee
Submitted
October 23, 2025
Key Contributions
This paper demonstrates that Direct Preference Optimization (DPO) can be a misspecified estimator when the true reward function is not realizable by the policy class, leading to failure modes. It provides a theoretical characterization relating DPO to natural gradient steps in policy space and proposes AuxDPO, which introduces auxiliary variables to mitigate misspecification and guide the model towards the RLHF solution more effectively. This offers a principled way to improve direct alignment methods.
Business Value
Provides a more stable and reliable method for aligning LLMs using preference data, reducing the risk of undesirable model behaviors and improving the quality of aligned models. This can lead to more trustworthy and predictable AI systems.