Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Alignment of large language models (LLMs) has predominantly relied on
pairwise preference optimization, where annotators select the better of two
responses to a prompt. While simple, this approach overlooks the opportunity to
learn from richer forms of human feedback, such as multiwise comparisons and
top-$k$ rankings. We propose Ranked Choice Preference Optimization (RCPO), a
unified framework that bridges preference optimization with (ranked) choice
modeling via maximum likelihood estimation. The framework is flexible,
supporting both utility-based and rank-based choice models. It subsumes several
existing pairwise methods (e.g., DPO, SimPO), while providing principled
training objectives for richer feedback formats. We instantiate this framework
with two representative ranked choice models (Multinomial Logit and
Mallows-RMJ). Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across
AlpacaEval 2 and Arena-Hard benchmarks show that RCPO consistently outperforms
competitive baselines. RCPO shows how directly leveraging ranked preference
data, combined with the right choice models, yields more effective alignment.
It offers a versatile and extensible foundation for incorporating (ranked)
choice modeling into LLM training.
Submitted
October 24, 2025
Key Contributions
This paper introduces Ranked Choice Preference Optimization (RCPO), a unified framework that extends pairwise preference optimization to richer forms of human feedback like multiwise comparisons and top-k rankings. RCPO bridges preference optimization with choice modeling via maximum likelihood estimation, subsuming existing pairwise methods and providing principled objectives for diverse feedback formats, which is crucial for more nuanced and effective LLM alignment.
Business Value
Enables the development of more sophisticated and human-aligned AI assistants and content generation tools by leveraging more informative human feedback, leading to higher user satisfaction and trust.