arxiv_cl 95% Match Research paper AI safety researchers,LLM developers,Cybersecurity professionals,AI ethicists 3 weeks ago

RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

ai-safety › alignment

📄 Abstract

Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.

Authors (5)

Tuan T. Nguyen

John Le

Thai T. Vu

Willy Susilo

Heath Cooper

Submitted

October 14, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces RAID (Refusal-Aware and Integrated Decoding), a framework for systematically probing LLM jailbreak vulnerabilities. RAID crafts adversarial suffixes by optimizing continuous embeddings with a joint objective that encourages restricted responses, steers away from refusal, and maintains semantic coherence, resulting in effective and natural-form jailbreaks.

Business Value

Helps developers understand and mitigate vulnerabilities in LLMs, leading to more secure and reliable AI systems, crucial for enterprise adoption.

Paper Metadata

Innovation Type

Adversarial attack framework and decoding strategy

Deployment Feasibility

High for research and security testing; the framework itself is a tool for analysis.

Limitations Addressed

LLMs' vulnerability to jailbreak attacks,Difficulty in bypassing safety mechanisms,Need for effective methods to probe LLM weaknesses

Performance Gains

RAID generates effective suffixes that bypass defenses while maintaining fluency, demonstrating a systematic approach to jailbreaking.

Technical Tags

jailbreaking LLMsadversarial attacksrefusal-aware decodingintegrated decodingsafety mechanismsrestricted contentsemantic plausibilityembedding space optimization

Research Topics

LLM SecurityAI SafetyAdversarial Machine LearningModel Robustness

Methods & Architectures

RAID frameworkAdversarial suffix craftingContinuous embedding optimizationRefusal-aware regularizerCoherence termCritic-guided decoding Large Language Models (LLMs)

Applications & Tasks

AI Safety LLM Security Content Moderation Vulnerability of LLMs to jailbreak attacksBypassing safety mechanismsGenerating restricted content Developing effective jailbreak attacksProbing LLM weaknessesCrafting adversarial suffixes

Datasets & Benchmarks

Benchmarks

Experiments on multiple open-source LLMs

Effectiveness in inducing restricted contentFluency and naturalness of generated suffixes

Related Fields

Machine LearningNatural Language ProcessingCybersecurityAI EthicsAdversarial Machine Learning

Keywords

LLMjailbreakadversarial attackAI safetyalignmentrefusaldecodingsecurityrestricted contentRAIDembeddings

Academic Context

#LLM Security#AI Safety#Adversarial Machine Learning#Model Robustness

Commercial Potential

Potential Products

AI security testing toolsVulnerability assessment services for LLMs

Target Industries

TechnologyCybersecurityAI Development

Use Case Examples

Testing the robustness of LLM safety filtersDeveloping defenses against adversarial prompts

Competitive Edge

Provides a novel and systematic framework for generating effective jailbreaks, advancing the understanding of LLM vulnerabilities.

Market Opportunity

Growing market for AI security and robustness solutions.

Revenue Models

Consultingsecurity auditslicensing of defense mechanisms.

Resource Requirements

Compute Needs

High (for optimization and LLM inference)

Data Requirements

Access to LLMs, potentially curated datasets for training attack components.

Deployment Constraints

Ethical considerations, potential for misuse.

Scalability

Scalable to different LLMs and attack objectives.

Regulatory Considerations

Ethical guidelines for AI security researchresponsible disclosure.

Production Readiness

Maturity Level

Research/Methodology

Time to Market

N/A (research focus)

Licensing

Likely open source for framework/code

Patent Potential

Low (attack methodology)

View Full Paper Back to Papers