Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models (LLMs) often acquire knowledge during pretraining that
is undesirable in downstream deployments, e.g., sensitive information or
copyrighted content. Existing approaches for removing such knowledge rely on
fine-tuning, training low-rank adapters or fact-level editing, but these are
either too coarse, too shallow, or ineffective. In this work, we propose PISCES
(Precise In-parameter Suppression for Concept EraSure), a novel framework for
precisely erasing entire concepts from model parameters by directly editing
directions that encode them in parameter space. PISCES uses a disentangler
model to decompose MLP vectors into interpretable features, identifies those
associated with a target concept using automated interpretability techniques,
and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1
over various concepts show that PISCES achieves modest gains in efficacy over
leading erasure methods, reducing accuracy on the target concept to as low as
7.7%, while dramatically improving erasure specificity (by up to 31%) and
robustness (by up to 38%). Overall, these results demonstrate that
feature-based in-parameter editing enables a more precise and reliable approach
for removing conceptual knowledge in language models.
Authors (5)
Yoav Gur-Arieh
Clara Suslik
Yihuai Hong
Fazl Barez
Mor Geva
Key Contributions
PISCES (Precise In-parameter Suppression for Concept EraSure) is a novel framework for precisely erasing entire concepts from LLM parameters by editing directions that encode them. It uses a disentangler model to decompose MLP vectors, identifies concept-associated features, and removes them, achieving modest gains in efficacy over leading methods.
Business Value
Enables organizations to deploy LLMs more safely and responsibly by removing sensitive information, copyrighted material, or biased knowledge, thereby mitigating risks and ensuring compliance.