Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI Safety Researchers,LLM Developers,Machine Learning Engineers,AI Ethicists 2 weeks ago

LLM Safety Alignment is Divergence Estimation in Disguise

ai-safety › alignment
📄 Abstract

Abstract: We present a theoretical framework showing that popular LLM alignment methods, including RLHF and its variants, can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance-refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.
Authors (5)
Rajdeep Haldar
Ziyi Wang
Qifan Song
Guang Lin
Yue Xing
Submitted
February 2, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper presents a theoretical framework showing LLM alignment methods (like RLHF) are divergence estimators. It proposes KLDO, a KL divergence-based alignment method, and demonstrates that compliance-refusal datasets yield stronger separation and improved safety. A distance-based metric is introduced to quantify safety.

Business Value

Enables the development of safer and more reliable LLMs, crucial for widespread adoption in sensitive applications and reducing risks associated with harmful AI outputs.