Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research AI safety researchers,ML engineers,AI ethicists,LLM developers 2 weeks ago

Annotating the Chain-of-Thought: A Behavior-Labeled Dataset for AI Safety

ai-safety β€Ί alignment
πŸ“„ Abstract

Abstract: Recent work has highlighted the importance of monitoring chain-of-thought reasoning for AI safety; however, current approaches that analyze textual reasoning steps can miss subtle harmful patterns and may be circumvented by models that hide unsafe reasoning. We present a sentence-level labeled dataset that enables activation-based monitoring of safety behaviors during LLM reasoning. Our dataset contains reasoning sequences with sentence-level annotations of safety behaviors such as expression of safety concerns or speculation on user intent, which we use to extract steering vectors for detecting and influencing these behaviors within model activations. The dataset fills a key gap in safety research: while existing datasets label reasoning holistically, effective application of steering vectors for safety monitoring could be improved by identifying precisely when specific behaviors occur within reasoning chains. We demonstrate the dataset's utility by extracting representations that both detect and steer safety behaviors in model activations, showcasing the potential of activation-level techniques for improving safety oversight on reasoning. Content Warning: This paper discusses AI safety in the context of harmful prompts and may contain references to potentially harmful content.
Authors (3)
Antonio-Gabriel ChacΓ³n Menke
Phan Xuan Tan
Eiji Kamioka
Submitted
October 20, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

This paper introduces a sentence-level labeled dataset for AI safety that enables activation-based monitoring of safety behaviors during LLM chain-of-thought reasoning. The dataset allows for the extraction of steering vectors to detect and influence specific safety behaviors (e.g., safety concerns, user intent speculation) within model activations, addressing a key gap in safety research for fine-grained temporal analysis.

Business Value

Enhances the safety and trustworthiness of AI systems by providing tools to detect and mitigate subtle harmful behaviors, crucial for deploying AI in sensitive applications.