Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper presents the first quantitative study on 'test awareness' in reasoning LLMs, demonstrating how it impacts safety alignment and performance on safety-related tasks. It introduces a novel white-box probing framework to identify and steer this awareness, offering a method to improve model safety and reliability.
Understanding and mitigating test awareness in LLMs is crucial for deploying safe and reliable AI systems. This research helps ensure that models perform as intended, even in evaluation scenarios, leading to more trustworthy AI applications in sensitive domains.