📄 Abstract
Abstract: Agentic methods have emerged as a powerful and autonomous paradigm that
enhances reasoning, collaboration, and adaptive control, enabling systems to
coordinate and independently solve complex tasks. We extend this paradigm to
safety alignment by introducing Agentic Moderation, a model-agnostic framework
that leverages specialised agents to defend multimodal systems against
jailbreak attacks. Unlike prior approaches that apply as a static layer over
inputs or outputs and provide only binary classifications (safe or unsafe), our
method integrates dynamic, cooperative agents, including Shield, Responder,
Evaluator, and Reflector, to achieve context-aware and interpretable
moderation. Extensive experiments across five datasets and four representative
Large Vision-Language Models (LVLMs) demonstrate that our approach reduces the
Attack Success Rate (ASR) by 7-19%, maintains a stable Non-Following Rate (NF),
and improves the Refusal Rate (RR) by 4-20%, achieving robust, interpretable,
and well-balanced safety performance. By harnessing the flexibility and
reasoning capacity of agentic architectures, Agentic Moderation provides
modular, scalable, and fine-grained safety enforcement, highlighting the
broader potential of agentic systems as a foundation for automated safety
governance.
Authors (3)
Juan Ren
Mark Dras
Usman Naseem
Submitted
October 29, 2025
Key Contributions
Agentic Moderation is a model-agnostic framework using specialized, cooperative agents to defend multimodal systems against jailbreak attacks. Unlike static methods, it provides dynamic, context-aware, and interpretable moderation, significantly reducing attack success rates while maintaining stable performance.
Business Value
Enhances the safety and trustworthiness of AI systems, crucial for widespread adoption in sensitive applications and for protecting users from harmful content or misuse.