Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research paper AI safety researchers,LLM developers,Cybersecurity professionals,AI ethicists 3 weeks ago

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

ai-safety › alignment
📄 Abstract

Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
Authors (5)
Tuan T. Nguyen
John Le
Thai T. Vu
Willy Susilo
Heath Cooper
Submitted
October 14, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces RAID (Refusal-Aware and Integrated Decoding), a framework for systematically probing LLM jailbreak vulnerabilities. RAID crafts adversarial suffixes by optimizing continuous embeddings with a joint objective that encourages restricted responses, steers away from refusal, and maintains semantic coherence, resulting in effective and natural-form jailbreaks.

Business Value

Helps developers understand and mitigate vulnerabilities in LLMs, leading to more secure and reliable AI systems, crucial for enterprise adoption.