Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: To what extent does concept erasure eliminate generative capacity in
diffusion models? While prior evaluations have primarily focused on measuring
concept suppression under specific textual prompts, we explore a complementary
and fundamental question: do current concept erasure techniques genuinely
remove the ability to generate targeted concepts, or do they merely achieve
superficial, prompt-specific suppression? We systematically evaluate the
robustness and reversibility of two representative concept erasure methods,
Unified Concept Editing and Erased Stable Diffusion, by probing their ability
to eliminate targeted generative behaviors in text-to-image models. These
methods attempt to suppress undesired semantic concepts by modifying internal
model parameters, either through targeted attention edits or model-level
fine-tuning strategies. To rigorously assess whether these techniques truly
erase generative capacity, we propose an instance-level evaluation strategy
that employs lightweight fine-tuning to explicitly test the reactivation
potential of erased concepts. Through quantitative metrics and qualitative
analyses, we show that erased concepts often reemerge with substantial visual
fidelity after minimal adaptation, indicating that current methods suppress
latent generative representations without fully eliminating them. Our findings
reveal critical limitations in existing concept erasure approaches and
highlight the need for deeper, representation-level interventions and more
rigorous evaluation standards to ensure genuine, irreversible removal of
concepts from generative models.