Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Deep neural networks (DNNs) are highly susceptible to adversarial
examples--subtle, imperceptible perturbations that can lead to incorrect
predictions. While detection-based defenses offer a practical alternative to
adversarial training, many existing methods depend on external models, complex
architectures, or adversarial data, limiting their efficiency and
generalizability. We introduce a lightweight, plug-in detection framework that
leverages internal layer-wise inconsistencies within the target model itself,
requiring only benign data for calibration. Our approach is grounded in the A
Few Large Shifts Assumption, which posits that adversarial perturbations induce
large, localized violations of layer-wise Lipschitz continuity in a small
subset of layers. Building on this, we propose two complementary
strategies--Recovery Testing (RT) and Logit-layer Testing (LT)--to empirically
measure these violations and expose internal disruptions caused by adversaries.
Evaluated on CIFAR-10, CIFAR-100, and ImageNet under both standard and adaptive
threat models, our method achieves state-of-the-art detection performance with
negligible computational overhead. Furthermore, our system-level analysis
provides a practical method for selecting a detection threshold with a formal
lower-bound guarantee on accuracy. The code is available here:
https://github.com/c0510gy/AFLS-AED.
Key Contributions
Introduces a lightweight, plug-in adversarial example detection framework that uses internal layer-wise inconsistencies within the target model, grounded in the 'A Few Large Shifts' assumption. It proposes Recovery Testing (RT) and Logit-layer Testing (LT) for empirical measurement.
Business Value
Enhances the security and trustworthiness of AI systems by providing a practical and efficient way to detect adversarial attacks, crucial for sensitive applications.