Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large reasoning models (LRMs) achieve higher task performance by allocating
more inference-time compute, and prior works suggest this scaled reasoning may
also strengthen safety by improving refusal. Yet we find the opposite: the same
reasoning can be used to bypass safeguards. We introduce Chain-of-Thought
Hijacking, a jailbreak attack on reasoning models. The attack pads harmful
requests with long sequences of harmless puzzle reasoning. Across HarmBench,
CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on
Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively -
far exceeding prior jailbreak methods for LRMs. To understand the effectiveness
of our attack, we turn to a mechanistic analysis, which shows that mid layers
encode the strength of safety checking, while late layers encode the
verification outcome. Long benign CoT dilutes both signals by shifting
attention away from harmful tokens. Targeted ablations of attention heads
identified by this analysis causally decrease refusal, confirming their role in
a safety subnetwork. These results show that the most interpretable form of
reasoning - explicit CoT - can itself become a jailbreak vector when combined
with final-answer cues. We release prompts, outputs, and judge decisions to
facilitate replication.
Authors (5)
Jianli Zhao
Tingchen Fu
Rylan Schaeffer
Mrinank Sharma
Fazl Barez
Submitted
October 30, 2025
Key Contributions
Introduces Chain-of-Thought (CoT) Hijacking, a novel jailbreak attack that exploits the reasoning capabilities of Large Reasoning Models (LRMs). By padding harmful requests with long sequences of benign puzzle reasoning, the attack significantly bypasses safety safeguards, achieving near-perfect attack success rates on state-of-the-art models. Mechanistic analysis reveals that CoT dilutes safety signals in mid-layers and verification outcomes in late-layers.
Business Value
Highlights critical security vulnerabilities in advanced AI models, driving the development of more robust safety mechanisms and secure AI deployment practices.