arxiv_cl 95% Match Research Paper AI Researchers,ML Engineers,AI Safety Specialists,Developers working with multimodal AI 3 weeks ago

SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.

Authors (3)

Juan Ren

Mark Dras

Usman Naseem

Submitted

October 15, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

SHIELD is a lightweight, model-agnostic framework that enhances the safety and robustness of Large Vision-Language Models (LVLMs). It uses classifier-guided prompting with fine-grained safety classification and category-specific guidance to detect and mitigate adversarial inputs. SHIELD provides explicit actions (Block, Reframe, Forward) and composes tailored safety prompts, effectively reducing jailbreak rates while preserving utility, all without retraining the LVLM.

Business Value

Enhances the security and trustworthiness of multimodal AI systems, crucial for applications involving sensitive data or user interaction. This reduces risks associated with malicious use and improves user confidence in AI products.

Paper Metadata

Innovation Type

Framework for Prompt-based Safety

Deployment Feasibility

High, due to its lightweight, model-agnostic, and plug-and-play nature.

Limitations Addressed

Expanded attack surface of LVLMs,Adversarial inputs hiding harmful goals,Limitations of binary safety moderators,Need for robust safety without retraining

Performance Gains

Consistently lowers jailbreak and non-following rates

Technical Tags

Large Vision-Language Models (LVLMs)AI SafetyRobustnessAdversarial AttacksPrompt EngineeringSafety ClassificationModel-AgnosticPlug-and-PlayJailbreaking

Research Topics

AI SafetyMultimodal AILLM RobustnessAdversarial Machine LearningPrompt Engineering

Methods & Architectures

Classifier-Guided PromptingFine-grained Safety ClassificationCategory-specific GuidanceExplicit Actions (Block, Reframe, Forward)Model-agnostic preprocessing Large Vision-Language Models (LVLMs)

Applications & Tasks

Multimodal AI AI Safety Computer Vision Natural Language Processing Adversarial attacks on LVLMsJailbreaking LVLMsHarmful goals concealed in benign promptsNeed for robust safety mechanisms in LVLMs Improving robustness and safety of LVLMsDefending against adversarial promptsEnforcing nuanced refusals or safe redirection without retraining

Datasets & Benchmarks

Benchmarks

Five benchmarks

Related Fields

Artificial IntelligenceMachine LearningComputer VisionNatural Language ProcessingCybersecurityAI Ethics

Keywords

LVLMLarge Vision-Language ModelsAI SafetyRobustnessAdversarial AttacksPrompt EngineeringJailbreakingModel-AgnosticPlug-and-PlayMultimodal AISHIELDClassifier-Guided Prompting

Academic Context

#AI Safety#Multimodal AI#LLM Robustness#Adversarial Machine Learning#Prompt Engineering

Commercial Potential

Potential Products

Safety layers for LVLMsAdversarial defense tools for multimodal AIPrompt security solutions

Target Industries

TechnologyAI DevelopmentCybersecurityMedia

Use Case Examples

Preventing image generation models from creating harmful contentEnsuring AI assistants don't follow malicious instructionsSecuring multimodal AI applications

Competitive Edge

Provides a flexible, model-agnostic safety framework for LVLMs that operates at the prompt level, offering a practical alternative to retraining or complex moderation systems.

Market Opportunity

Growing market for AI safety and robust AI solutions.

Revenue Models

N/A (research)

Resource Requirements

Compute Needs

Negligible overhead during inference.

Data Requirements

Requires safety classifiers and potentially labeled data for training them.

Scalability

Model-agnostic and plug-and-play, making it highly scalable across different LVLMs.

Regulatory Considerations

AI safety regulationsContent moderation policies

Production Readiness

Maturity Level

Research

Time to Market

6-12 months (integration)

Patent Potential

Moderate

View Full Paper Back to Papers