Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Safety Researchers,Machine Learning Engineers,LLM Developers,AI Ethics Professionals 2 weeks ago

Leveraging Robust Optimization for LLM Alignment under Distribution Shifts

large-language-models › alignment
📄 Abstract

Abstract: Preference alignment methods are increasingly critical for steering large language models (LLMs) to generate outputs consistent with human values. While recent approaches often rely on synthetic data generated by LLMs for scalability and cost-efficiency reasons, this reliance can introduce distribution shifts that undermine the nuanced representation of human preferences needed for desirable outputs. In this paper, we propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts. Our approach first leverages well-learned classifiers to assign a calibration value to each training sample, quantifying its alignment with the target human-preferred distribution. These values are then incorporated into a robust optimization objective that minimizes the worst-case loss over regions of the data space most relevant to human preferences. By explicitly focusing optimization on the target distribution, our approach mitigates the impact of distributional mismatch and improves the generation of responses that better reflect intended values.
Authors (5)
Mingye Zhu
Yi Liu
Zheren Fu
Yongdong Zhang
Zhendong Mao
Submitted
April 8, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Proposes a novel distribution-aware robust optimization framework to improve LLM preference alignment despite distribution shifts. It uses calibration values to quantify sample alignment and minimizes worst-case loss, focusing optimization on the target distribution to mitigate negative impacts.

Business Value

Enhances the reliability and trustworthiness of LLMs by ensuring their outputs consistently align with human values, even when faced with evolving or imperfect training data, crucial for sensitive applications.