Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) excel in various capabilities but pose safety
risks such as generating harmful content and misinformation, even after safety
alignment. In this paper, we explore the inner mechanisms of safety alignment
through the lens of mechanistic interpretability, focusing on identifying and
analyzing safety neurons within LLMs that are responsible for safety behaviors.
We propose inference-time activation contrasting to locate these neurons and
dynamic activation patching to evaluate their causal effects on model safety.
Experiments on multiple prevalent LLMs demonstrate that we can consistently
identify about $5\%$ safety neurons, and by only patching their activations we
can restore over $90\%$ of the safety performance across various red-teaming
benchmarks without influencing general ability. The finding of safety neurons
also helps explain the ''alignment tax'' phenomenon by revealing that the key
neurons for model safety and helpfulness significantly overlap, yet they
require different activation patterns for the same neurons. Furthermore, we
demonstrate an application of our findings in safeguarding LLMs by detecting
unsafe outputs before generation. The source code is available at
https://github.com/THU-KEG/SafetyNeuron.
Authors (6)
Jianhui Chen
Xiaozhi Wang
Zijun Yao
Yushi Bai
Lei Hou
Juanzi Li
Key Contributions
This paper provides a mechanistic perspective on safety alignment in LLMs by identifying and analyzing 'safety neurons' responsible for safe behaviors. Using activation patching, it demonstrates that these neurons causally influence safety performance, restoring over 90% of safety across benchmarks without significantly impacting general abilities, and helps explain the 'alignment tax'.
Business Value
Enhances trust and reliability in LLMs by providing methods to understand and verify their safety mechanisms, crucial for widespread adoption in sensitive applications.