Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper introduces a sentence-level labeled dataset for AI safety that enables activation-based monitoring of safety behaviors during LLM chain-of-thought reasoning. The dataset allows for the extraction of steering vectors to detect and influence specific safety behaviors (e.g., safety concerns, user intent speculation) within model activations, addressing a key gap in safety research for fine-grained temporal analysis.
Enhances the safety and trustworthiness of AI systems by providing tools to detect and mitigate subtle harmful behaviors, crucial for deploying AI in sensitive applications.