Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper introduces two variants of Direct Preference Optimization (DPO) that explicitly model ties in pairwise comparisons, using extensions to the Bradley-Terry model. Experiments show these variants improve performance in NMT and summarization, provide stronger regularization, and offer theoretical explanations for this effect, outperforming standard DPO in translation and mathematical reasoning.
Enables more robust and accurate alignment of LLMs with human preferences, leading to better-performing and more reliable AI systems for various applications.