Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As AI systems approach dangerous capability levels where inability safety
cases become insufficient, we need alternative approaches to ensure safety.
This paper presents a roadmap for constructing safety cases based on
chain-of-thought (CoT) monitoring in reasoning models and outlines our research
agenda. We argue that CoT monitoring might support both control and
trustworthiness safety cases. We propose a two-part safety case: (1)
establishing that models lack dangerous capabilities when operating without
their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT
are detectable by CoT monitoring. We systematically examine two threats to
monitorability: neuralese and encoded reasoning, which we categorize into three
forms (linguistic drift, steganography, and alien reasoning) and analyze their
potential drivers. We evaluate existing and novel techniques for maintaining
CoT faithfulness. For cases where models produce non-monitorable reasoning, we
explore the possibility of extracting a monitorable CoT from a non-monitorable
CoT. To assess the viability of CoT monitoring safety cases, we establish
prediction markets to aggregate forecasts on key technical milestones
influencing their feasibility.
Submitted
October 22, 2025
Key Contributions
Presents a concrete roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models. It proposes a two-part safety case focusing on detecting dangerous capabilities enabled by CoT and analyzing threats to monitorability like neuralese and encoded reasoning.
Business Value
Crucial for building trust and ensuring responsible development of advanced AI systems, mitigating risks associated with unpredictable or harmful AI behavior, and facilitating regulatory compliance.