arxiv_ai 95% Match Research Paper AI Safety Researchers,ML Engineers,Developers of LLMs and LVLMs,AI Ethicists 1 week ago

Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models

ai-safety › alignment

📄 Abstract

Abstract: Agentic methods have emerged as a powerful and autonomous paradigm that enhances reasoning, collaboration, and adaptive control, enabling systems to coordinate and independently solve complex tasks. We extend this paradigm to safety alignment by introducing Agentic Moderation, a model-agnostic framework that leverages specialised agents to defend multimodal systems against jailbreak attacks. Unlike prior approaches that apply as a static layer over inputs or outputs and provide only binary classifications (safe or unsafe), our method integrates dynamic, cooperative agents, including Shield, Responder, Evaluator, and Reflector, to achieve context-aware and interpretable moderation. Extensive experiments across five datasets and four representative Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF), and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable, and well-balanced safety performance. By harnessing the flexibility and reasoning capacity of agentic architectures, Agentic Moderation provides modular, scalable, and fine-grained safety enforcement, highlighting the broader potential of agentic systems as a foundation for automated safety governance.

Authors (3)

Juan Ren

Mark Dras

Usman Naseem

Submitted

October 29, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Agentic Moderation is a model-agnostic framework using specialized, cooperative agents to defend multimodal systems against jailbreak attacks. Unlike static methods, it provides dynamic, context-aware, and interpretable moderation, significantly reducing attack success rates while maintaining stable performance.

Business Value

Enhances the safety and trustworthiness of AI systems, crucial for widespread adoption in sensitive applications and for protecting users from harmful content or misuse.

Paper Metadata

Innovation Type

Framework/Methodological

Deployment Feasibility

High, as it's a model-agnostic framework that can be integrated with existing LVLMs.

Limitations Addressed

Prior approaches to LLM safety alignment are often static layers providing binary classifications, failing to offer dynamic, context-aware, or interpretable moderation against sophisticated attacks like jailbreaks.

Performance Gains

Reduces ASR by 7-19%, maintains stable NF, and improves RR by 4-20%.

Technical Tags

AI SafetyAgentic ModerationMultimodal SystemsJailbreak AttacksLLM SecurityMulti-Agent SystemsRobustnessInterpretability

Research Topics

AI SafetyRobustnessAdversarial AttacksAI AlignmentMulti-Agent Systems

Methods & Architectures

Agentic Moderation frameworkSpecialized agents (Shield, Responder, Evaluator, Reflector)Dynamic, cooperative agent interactionContext-aware moderation Large Vision-Language Models (LVLMs)Multi-Agent Systems

Applications & Tasks

AI Safety Content Moderation AI Security Defending multimodal systems against jailbreak attacksImproving safety alignmentAchieving context-aware moderation Jailbreak attack detection and preventionContent safety moderation

Datasets & Benchmarks

Datasets

five datasets

Attack Success Rate (ASR)Non-Following Rate (NF)Refusal Rate (RR)

Related Fields

AI SafetyMachine Learning SecurityNatural Language ProcessingComputer VisionMulti-Agent SystemsAI Ethics

Keywords

AI SafetyAgentic ModerationJailbreak AttacksLLM SecurityMultimodal AILVLMMulti-Agent SystemsRobustnessAlignmentInterpretabilityContent Moderation

Academic Context

#AI Safety#Robustness#Adversarial Attacks#AI Alignment#Multi-Agent Systems

Commercial Potential

Potential Products

AI safety modules for LLMs/LVLMsContent moderation platformsSecurity solutions for AI systems

Target Industries

TechnologySocial MediaAI DevelopmentCybersecurity

Use Case Examples

Preventing LLMs from generating harmful or biased contentSecuring multimodal AI assistants against adversarial promptsAutomated content filtering and moderation

Competitive Edge

Offers a more dynamic, interpretable, and effective approach to defending against jailbreak attacks compared to static, binary classification methods.

Market Opportunity

Growing market for AI safety and security solutions.

Revenue Models

SaaS for AI safety serviceslicensing of the moderation framework.

Resource Requirements

Compute Needs

Requires additional compute for running the moderation agent system alongside the main LVLM.

Data Requirements

Requires datasets of safe and unsafe interactions, including examples of jailbreak attacks.

Deployment Constraints

Latency introduced by the moderation agents,Complexity of agent coordination

Scalability

Scalability depends on the efficiency of the agent coordination and inference.

Regulatory Considerations

Content moderation policiesAI ethics guidelines

Production Readiness

Maturity Level

Demonstrated

Time to Market

1-2 years for integration into commercial products.

View Full Paper Back to Papers