Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As Large Language Models are rapidly deployed across diverse applications
from healthcare to financial advice, safety evaluation struggles to keep pace.
Current benchmarks focus on single-turn interactions with generic policies,
failing to capture the conversational dynamics of real-world usage and the
application-specific harms that emerge in context. Such potential oversights
can lead to harms that go unnoticed in standard safety benchmarks and other
current evaluation methodologies. To address these needs for robust AI safety
evaluation, we introduce SAGE (Safety AI Generic Evaluation), an automated
modular framework designed for customized and dynamic harm evaluations. SAGE
employs prompted adversarial agents with diverse personalities based on the Big
Five model, enabling system-aware multi-turn conversations that adapt to target
applications and harm policies. We evaluate seven state-of-the-art LLMs across
three applications and harm policies. Multi-turn experiments show that harm
increases with conversation length, model behavior varies significantly when
exposed to different user personalities and scenarios, and some models minimize
harm via high refusal rates that reduce usefulness. We also demonstrate policy
sensitivity within a harm category where tightening a child-focused sexual
policy substantially increases measured defects across applications. These
results motivate adaptive, policy-aware, and context-specific testing for safer
real-world deployment.
Authors (4)
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
Key Contributions
SAGE is a novel, automated, and modular framework for LLM safety evaluation that addresses the limitations of current benchmarks. It employs prompted adversarial agents with diverse personalities to conduct system-aware, multi-turn conversations, enabling customized and dynamic harm evaluations across various applications.
Business Value
Provides organizations with a more rigorous and adaptable method to ensure the safety and reliability of LLM deployments, mitigating risks and building user trust.