Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research Paper AI Safety Researchers,LLM Developers,ML Engineers,AI Ethicists 1 week ago

The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

ai-safety › alignment
📄 Abstract

Abstract: Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such "test awareness" impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.
Authors (2)
Sahar Abdelnabi
Ahmed Salem
Submitted
May 20, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

This paper presents the first quantitative study on 'test awareness' in reasoning LLMs, demonstrating how it impacts safety alignment and performance on safety-related tasks. It introduces a novel white-box probing framework to identify and steer this awareness, offering a method to improve model safety and reliability.

Business Value

Understanding and mitigating test awareness in LLMs is crucial for deploying safe and reliable AI systems. This research helps ensure that models perform as intended, even in evaluation scenarios, leading to more trustworthy AI applications in sensitive domains.