Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Safety Researchers,Machine Learning Engineers,NLP Researchers,Developers of LLM applications 1 week ago

Offline Preference Optimization via Maximum Marginal Likelihood Estimation

large-language-models › alignment
📄 Abstract

Abstract: Aligning Large Language Models (LLMs) with human preferences is crucial, but standard methods like Reinforcement Learning from Human Feedback (RLHF) are often complex and unstable. In this work, we propose a new, simpler approach that recasts alignment through the lens of Maximum Marginal Likelihood (MML) estimation. Our new MML based Preference Optimization (MMPO) maximizes the marginal log-likelihood of a preferred text output, using the preference pair as samples for approximation, and forgoes the need for both an explicit reward model and entropy maximization. We theoretically demonstrate that MMPO implicitly performs preference optimization, producing a weighted gradient that naturally up-weights chosen responses over rejected ones. Across models ranging from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable with respect to the hyperparameter $\beta$ compared to alternative baselines, and 2) achieves competitive or superior preference alignment while better preserving the base model's general language capabilities. Through a series of ablation experiments, we show that this improved performance is indeed attributable to MMPO's implicit preference optimization within the gradient updates.
Authors (2)
Saeed Najafi
Alona Fyshe
Submitted
October 27, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

MMPO (Maximum Marginal Likelihood based Preference Optimization) offers a simpler and more stable alternative to RLHF for aligning LLMs. By recasting alignment as MML estimation, it maximizes the marginal log-likelihood of preferred outputs using preference pairs, eliminating the need for explicit reward models and entropy maximization, resulting in competitive or superior alignment.

Business Value

Leads to more reliable and user-aligned AI systems, enhancing user satisfaction and trust in applications like chatbots, content generators, and virtual assistants.