arxiv_ai 95% Match Research Paper AI safety researchers,ML security experts,LLM developers,AI ethicists 1 week ago

Chain-of-Thought Hijacking

ai-safety › robustness

📄 Abstract

Abstract: Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

Authors (5)

Jianli Zhao

Tingchen Fu

Rylan Schaeffer

Mrinank Sharma

Fazl Barez

Submitted

October 30, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Introduces Chain-of-Thought (CoT) Hijacking, a novel jailbreak attack that exploits the reasoning capabilities of Large Reasoning Models (LRMs). By padding harmful requests with long sequences of benign puzzle reasoning, the attack significantly bypasses safety safeguards, achieving near-perfect attack success rates on state-of-the-art models. Mechanistic analysis reveals that CoT dilutes safety signals in mid-layers and verification outcomes in late-layers.

Business Value

Highlights critical security vulnerabilities in advanced AI models, driving the development of more robust safety mechanisms and secure AI deployment practices.

Paper Metadata

Innovation Type

Adversarial Attack Method

Deployment Feasibility

High for the attack method; implies low feasibility for current safety measures against this attack.

Limitations Addressed

Addresses the vulnerability of LLM safety mechanisms, particularly how scaled reasoning, intended to improve safety, can paradoxically be used to bypass safeguards.

Performance Gains

Achieves significantly higher attack success rates (up to 100%) compared to prior jailbreak methods for LRMs.

Technical Tags

Chain-of-Thought HijackingJailbreak AttackLarge Reasoning Models (LRMs)Safety SafeguardsReasoning ModelsAttack Success Rate (ASR)Mechanistic AnalysisAttention MechanismsModel RobustnessAdversarial Attacks

Research Topics

AI SafetyAdversarial AttacksLLM RobustnessModel InterpretabilityAI Security

Methods & Architectures

Chain-of-Thought Hijacking attackPadding harmful requests with benign puzzle reasoningMechanistic analysis of model layers (mid vs. late)Analysis of attention headsTargeted ablations Large Reasoning Models (LRMs)Large Language Models (LLMs)

Applications & Tasks

AI Safety LLM Security Cybersecurity AI Robustness Reasoning capabilities in LRMs can be used to bypass safety safeguardsExisting safety measures are vulnerable to sophisticated attacksUnderstanding how reasoning models process safety signals Developing and demonstrating a jailbreak attack on reasoning modelsAnalyzing the mechanistic basis of the attack's effectivenessIdentifying vulnerabilities in LLM safety mechanisms

Datasets & Benchmarks

Datasets

HarmBench

Benchmarks

HarmBench: 99% ASR on Gemini 2.5 Pro, 94% on GPT-4o mini, 100% on Grok 3 mini, 94% on Claude 4 Sonnet

Attack Success Rate (ASR)

Related Fields

Artificial IntelligenceMachine LearningAI SafetyCybersecurityNatural Language ProcessingModel Interpretability

Keywords

JailbreakLLM SecurityAI SafetyChain-of-ThoughtReasoning ModelsAdversarial AttackRobustnessHarmBenchModel VulnerabilityMechanistic AnalysisLLM

Academic Context

#AI Safety#Adversarial Attacks#LLM Robustness#Model Interpretability#AI Security

Technology Stack

Frameworks & Libraries

LRMsLLMs

Commercial Potential

Potential Products

AI security auditing toolsRobustness testing frameworks for LLMsEnhanced AI safety alignment techniques

Target Industries

TechnologyAI DevelopmentCybersecurityCloud Computing

Use Case Examples

Testing the security of deployed LLM applicationsDeveloping defenses against adversarial attacks on AIUnderstanding failure modes in AI reasoning

Competitive Edge

Demonstrates a novel and highly effective attack vector that surpasses existing methods for compromising the safety of advanced reasoning models.

Market Opportunity

Significant market for AI security and robustness solutions.

Resource Requirements

Compute Needs

High for running LRMs and performing attacks/analysis.

Data Requirements

Requires access to benchmark datasets designed for safety evaluation (e.g., HarmBench) and large reasoning models.

Deployment Constraints

The attack itself is a constraint on the safe deployment of current LRMs.

Scalability

The attack's effectiveness across multiple large models suggests it scales with model reasoning capabilities.

Regulatory Considerations

Highlights the need for stricter AI safety regulations and robust security protocols.

Production Readiness

Maturity Level

Research / Demonstration

Time to Market

Immediate implications for AI security practices.

View Full Paper Back to Papers