Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The ability of LLM agents to plan and invoke tools exposes them to new safety
risks, making a comprehensive red-teaming system crucial for discovering
vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic
red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic
two-step process that starts with an agent definition and generates diverse
seed test cases that cover various risk outcomes, tool-use trajectories, and
risk sources. Then, it iteratively constructs and refines model-based
adversarial attacks based on the execution trajectories of former attempts. To
optimize the red-teaming cost, we present a model distillation approach that
leverages structured forms of a teacher model's reasoning to train smaller
models that are equally effective. Across diverse evaluation agent settings,
our seed test case generation approach yields 2 -- 2.5x boost to the coverage
of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer
model improves attack success rate by 100%, surpassing the 671B Deepseek-R1
model. Our ablations and analyses validate the effectiveness of the iterative
framework, structured reasoning, and the generalization of our red-teamer
models.
Authors (4)
Kaiwen Zhou
Ahmed Elgohary
A S M Iftekhar
Amin Saied
Submitted
October 30, 2025
Key Contributions
Introduces SIRAJ, a generic red-teaming framework for LLM agents that plans and invokes tools. It employs a dynamic process to generate diverse test cases covering various risks and iteratively refines attacks. A key innovation is model distillation using structured reasoning to train smaller, effective red-teaming models, optimizing cost.
Business Value
Significantly enhances the safety and reliability of AI agents by proactively identifying and mitigating potential vulnerabilities before deployment, reducing risks of misuse or unintended consequences.