Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The alignment of Large Language Models (LLMs) with human values is central to
their safe deployment, yet current practice produces static, brittle, and
costly-to-maintain models that fail to keep pace with evolving norms and
policies. This misalignment, which we term the Alignment-Reality Gap, poses a
growing challenge for reliable long-term use. Existing remedies are inadequate:
large-scale re-annotation is economically prohibitive, and standard unlearning
methods act as blunt instruments that erode utility rather than enable precise
policy updates. We introduce TRACE (Triage and Re-align by Alignment Conflict
Evaluation), a framework for principled unlearning that reconceives
re-alignment as a programmatic policy application problem. TRACE
programmatically triages existing preference data against a new policy,
identifies high-impact conflicts via a alignment impact score, and applies a
hybrid optimization that cleanly inverts, discards, or preserves preferences
while safeguarding model performance. Empirical results show that TRACE
achieves robust re-alignment across diverse model families (Qwen2.5-7B,
Gemma-2-9B, Llama-3.1-8B). On both synthetic benchmarks and the PKU-SafeRLHF
dataset under complex policy shift, TRACE enforces new principles without
degrading general capabilities. Our work establishes a scalable, dynamic, and
cost-effective paradigm for maintaining LLM alignment, providing a foundation
for sustainable and responsible AI deployment.
Key Contributions
This paper introduces TRACE, a novel framework for principled unlearning in LLMs that addresses the 'Alignment-Reality Gap'. TRACE reconceives re-alignment as a programmatic policy application problem, enabling precise policy updates by triaging preference data against new policies, identifying conflicts, and applying a hybrid optimization that preserves utility while adapting to evolving norms, overcoming the limitations of costly re-annotation and blunt unlearning methods.
Business Value
Allows organizations to maintain LLM compliance with changing regulations and ethical standards efficiently, reducing operational costs and reputational risks associated with misaligned AI.