Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Aligning Large Language Models (LLMs) with human preferences is crucial, but
standard methods like Reinforcement Learning from Human Feedback (RLHF) are
often complex and unstable. In this work, we propose a new, simpler approach
that recasts alignment through the lens of Maximum Marginal Likelihood (MML)
estimation. Our new MML based Preference Optimization (MMPO) maximizes the
marginal log-likelihood of a preferred text output, using the preference pair
as samples for approximation, and forgoes the need for both an explicit reward
model and entropy maximization. We theoretically demonstrate that MMPO
implicitly performs preference optimization, producing a weighted gradient that
naturally up-weights chosen responses over rejected ones. Across models ranging
from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable
with respect to the hyperparameter $\beta$ compared to alternative baselines,
and 2) achieves competitive or superior preference alignment while better
preserving the base model's general language capabilities. Through a series of
ablation experiments, we show that this improved performance is indeed
attributable to MMPO's implicit preference optimization within the gradient
updates.
Submitted
October 27, 2025
Key Contributions
MMPO (Maximum Marginal Likelihood based Preference Optimization) offers a simpler and more stable alternative to RLHF for aligning LLMs. By recasting alignment as MML estimation, it maximizes the marginal log-likelihood of preferred outputs using preference pairs, eliminating the need for explicit reward models and entropy maximization, resulting in competitive or superior alignment.
Business Value
Leads to more reliable and user-aligned AI systems, enhancing user satisfaction and trust in applications like chatbots, content generators, and virtual assistants.