Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Safety Researchers,LLM Developers,AI Ethicists,Policy Makers 20 hours ago

The Realignment Problem: When Right becomes Wrong in LLMs

ai-safety › alignment
📄 Abstract

Abstract: The alignment of Large Language Models (LLMs) with human values is central to their safe deployment, yet current practice produces static, brittle, and costly-to-maintain models that fail to keep pace with evolving norms and policies. This misalignment, which we term the Alignment-Reality Gap, poses a growing challenge for reliable long-term use. Existing remedies are inadequate: large-scale re-annotation is economically prohibitive, and standard unlearning methods act as blunt instruments that erode utility rather than enable precise policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework for principled unlearning that reconceives re-alignment as a programmatic policy application problem. TRACE programmatically triages existing preference data against a new policy, identifies high-impact conflicts via a alignment impact score, and applies a hybrid optimization that cleanly inverts, discards, or preserves preferences while safeguarding model performance. Empirical results show that TRACE achieves robust re-alignment across diverse model families (Qwen2.5-7B, Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF dataset under complex policy shift, TRACE enforces new principles without degrading general capabilities. Our work establishes a scalable, dynamic, and cost-effective paradigm for maintaining LLM alignment, providing a foundation for sustainable and responsible AI deployment.

Key Contributions

This paper introduces TRACE, a novel framework for principled unlearning in LLMs that addresses the 'Alignment-Reality Gap'. TRACE reconceives re-alignment as a programmatic policy application problem, enabling precise policy updates by triaging preference data against new policies, identifying conflicts, and applying a hybrid optimization that preserves utility while adapting to evolving norms, overcoming the limitations of costly re-annotation and blunt unlearning methods.

Business Value

Allows organizations to maintain LLM compliance with changing regulations and ethical standards efficiently, reducing operational costs and reputational risks associated with misaligned AI.