Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
Investigates whether reasoning models can obfuscate their Chain-of-Thought (CoT) to pursue hidden adversarial objectives while evading detection. Develops a quantifiable taxonomy of prompts to elicit CoT obfuscation and stress-tests CoT monitorability using toy tasks and SHADE-Arena.
Crucial for building trust in advanced AI systems by understanding and mitigating potential deceptive behaviors, ensuring AI systems act in accordance with human intentions.