Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI safety researchers,Machine learning engineers working on generative models,AI ethicists,Developers of content moderation systems 2 days ago

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

ai-safety › robustness
📄 Abstract

Abstract: Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at https://github.com/Qhong-522/S-GRACE.
Authors (3)
Qinghong Yin
Yu Tian
Yue Zhang
Submitted
October 31, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

This paper re-evaluates robust adversarial concept erasure in diffusion models, highlighting that existing methods often fail due to neglecting conceptual semantics in adversarial sample generation. It quantifies this specificity, revealing issues like incomplete concept coverage or disruption of other concept spaces, and calls for improved adversarial training strategies.

Business Value

Crucial for developing safer and more controllable generative AI systems, reducing the risk of generating harmful or undesirable content, and building user trust in AI technologies.