Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Methodology Research AI Safety Researchers,ML Engineers,Developers of LLMs/LVLMs,Cybersecurity Professionals 2 weeks ago

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

ai-safety › alignment
📄 Abstract

Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
Authors (5)
Shuang Liang
Zhihao Xu
Jialing Tao
Hui Xue
Xiting Wang
Submitted
October 17, 2025
arXiv Category
cs.CV
arXiv PDF Code

Key Contributions

This paper proposes Learning to Detect (LoD), a general framework for accurately detecting unknown jailbreak attacks in Large Vision-Language Models (LVLMs). LoD shifts focus from attack-specific learning to task-specific learning using a Multi-modal Safety Concept Activation Vector module and a Safety Pattern Auto-Encoder. It achieves higher detection AUROC on diverse unknown attacks and improves efficiency compared to existing methods.

Business Value

Ensuring the safety and robustness of powerful AI models like LVLMs is critical for their responsible deployment. This technology can help prevent misuse, protect users, and build trust in AI systems across various applications.

View Code on GitHub