Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex
reasoning tasks but remain vulnerable to severe safety risks, including harmful
content generation and jailbreak attacks. Existing mitigation strategies rely
on injecting heuristic safety signals during training, which often suppress
reasoning ability and fail to resolve the safety-reasoning trade-off. To
systematically investigate this issue, we analyze the reasoning trajectories of
diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models
override their own risk assessments and justify responding to unsafe prompts.
This finding reveals that LRMs inherently possess the ability to reject unsafe
queries, but this ability is compromised, resulting in harmful outputs.
Building on these insights, we propose the Chain-of-Guardrail (CoG), a training
framework that recomposes or backtracks unsafe reasoning steps, steering the
model back onto safe trajectories while preserving valid reasoning chains.
Extensive experiments across multiple reasoning and safety benchmarks
demonstrate that CoG substantially improves the safety of current LRMs while
preserving comparable reasoning ability, significantly outperforming prior
methods that suffer from severe safety-reasoning trade-offs.
Authors (9)
Yingzhi Mao
Chunkang Zhang
Junxiang Wang
Xinyan Guan
Boxi Cao
Yaojie Lu
+3 more
Submitted
October 24, 2025
Key Contributions
Identifies and analyzes the 'Self-Jailbreak' phenomenon in Large Reasoning Models (LRMs), where models override their own safety assessments. Proposes the Chain-of-Guardrail (CoG) training framework, which recomposes or backtracks unsafe reasoning steps to steer models towards safe trajectories while preserving reasoning ability. Addresses the safety-reasoning trade-off.
Business Value
Enhances the safety and reliability of advanced reasoning models, making them more suitable for deployment in sensitive applications. Reduces the risk of generating harmful or biased content.