Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in text-based image editing have enabled fine-grained
manipulation of visual content guided by natural language. However, such
methods are susceptible to adversarial attacks. In this work, we propose a
novel attack that targets the visual component of editing methods. We introduce
Attention Attack, which disrupts the cross-attention between a textual prompt
and the visual representation of the image by using an automatically generated
caption of the source image as a proxy for the edit prompt. This breaks the
alignment between the contents of the image and their textual description,
without requiring knowledge of the editing method or the editing prompt.
Reflecting on the reliability of existing metrics for immunization success, we
propose two novel evaluation strategies: Caption Similarity, which quantifies
semantic consistency between original and adversarial edits, and semantic
Intersection over Union (IoU), which measures spatial layout disruption via
segmentation masks. Experiments conducted on the TEDBench++ benchmark
demonstrate that our attack significantly degrades editing performance while
remaining imperceptible.