arxiv_ai 95% Match Methodology Research AI Safety Researchers,ML Engineers,Developers of LLMs/LVLMs,Cybersecurity Professionals 2 weeks ago

Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

ai-safety › alignment

📄 Abstract

Abstract: Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.

Authors (5)

Shuang Liang

Zhihao Xu

Jialing Tao

Hui Xue

Xiting Wang

Submitted

October 17, 2025

arXiv Category

cs.CV

arXiv PDF Code

Key Contributions

This paper proposes Learning to Detect (LoD), a general framework for accurately detecting unknown jailbreak attacks in Large Vision-Language Models (LVLMs). LoD shifts focus from attack-specific learning to task-specific learning using a Multi-modal Safety Concept Activation Vector module and a Safety Pattern Auto-Encoder. It achieves higher detection AUROC on diverse unknown attacks and improves efficiency compared to existing methods.

Business Value

Ensuring the safety and robustness of powerful AI models like LVLMs is critical for their responsible deployment. This technology can help prevent misuse, protect users, and build trust in AI systems across various applications.

Paper Metadata

Innovation Type

Detection Framework

Deployment Feasibility

Moderate (requires integration into LVLM deployment pipelines)

Limitations Addressed

Existing methods for detecting jailbreak attacks either learn attack-specific parameters (hindering generalization) or rely on heuristics (limiting accuracy/efficiency). LoD addresses this by focusing on task-specific learning for generalization to unknown attacks.

Performance Gains

Consistently higher detection AUROC on diverse unknown attacks,Improved efficiency

View Code on GitHub

Technical Tags

Jailbreak AttacksLarge Vision-Language Models (LVLMs)Unknown AttacksDetection FrameworkGeneralizationMulti-modal SafetyConcept Activation VectorsSafety Pattern Auto-EncoderAUROCUnsupervised Classification

Research Topics

AI SafetyRobustnessAdversarial AttacksLarge Language ModelsMultimodal AIModel Alignment

Methods & Architectures

Multi-modal Safety Concept Activation Vector (MS-CAV) moduleSafety Pattern Auto-Encoder (SPAE) moduleUnsupervised ClassificationTask-specific Learning Learning to Detect (LoD) framework

Applications & Tasks

AI Safety Content Moderation Responsible AI Robustness to Adversarial AttacksGeneralization to Unseen AttacksDetection of Malicious Inputs Detecting unknown jailbreak attacksEnsuring safety of LVLMs

Related Fields

AI SafetyAdversarial Machine LearningNatural Language ProcessingComputer VisionLarge Language Models

Keywords

JailbreakLVLMAI SafetyRobustnessAdversarial AttackDetectionGeneralizationMultimodalConcept Activation VectorsAuto-EncoderUnsupervised LearningResponsible AIAlignmentUnknown Attacks

Academic Context

#AI Safety#Robustness#Adversarial Attacks#Large Language Models#Multimodal AI#Model Alignment

Commercial Potential

Potential Products

AI safety monitoring toolsContent moderation systems for AI platformsSecurity solutions for LLM deployments

Target Industries

TechnologyAI DevelopmentSocial MediaCybersecurityCloud Computing

Use Case Examples

Preventing LVLMs from generating harmful or biased contentDetecting attempts to bypass safety filtersSecuring AI-powered applications against malicious inputs

Competitive Edge

LoD offers a more generalizable and efficient approach to detecting unknown jailbreak attacks compared to existing methods that are either attack-specific or heuristic-based.

Market Opportunity

Growing market for AI safety and security solutions

Revenue Models

Licensing of detection technologySecurity-as-a-service for AI deployments

Resource Requirements

Compute Needs

Moderate to High (for training detection models)

Data Requirements

Labeled data of normal and jailbroken inputs (potentially diverse)

Deployment Constraints

Real-time detection latency, maintaining detection accuracy as attacks evolve

Scalability

Scalability depends on the efficiency of the detection modules and the inference infrastructure.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years (for integration into AI platforms)

Patent Potential

Moderate (novel detection framework)

View Full Paper Back to Papers