Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Probes trained on model activations can detect undesirable behaviors like
deception or biases that are difficult to identify from outputs alone. This
makes them useful detectors to identify misbehavior. Furthermore, they are also
valuable training signals, since they not only reward outputs, but also good
internal processes for arriving at that output. However, training against
interpretability tools raises a fundamental concern: when a monitor becomes a
training target, it may cease to be reliable (Goodhart's Law). We propose two
methods for training against probes based on Supervised Fine-tuning and Direct
Preference Optimization. We conduct an initial exploration of these methods in
a testbed for reducing toxicity and evaluate the amount by which probe accuracy
drops when training against them. To retain the accuracy of probe-detectors
after training, we attempt (1) to train against an ensemble of probes, (2)
retain held-out probes that aren't used for training, and (3) retrain new
probes after training.
First, probe-based preference optimization unexpectedly preserves probe
detectability better than classifier-based methods, suggesting the preference
learning objective incentivizes maintaining rather than obfuscating relevant
representations. Second, probe diversity provides minimal practical benefit -
simply retraining probes after optimization recovers high detection accuracy.
Our findings suggest probe-based training can be viable for certain alignment
methods, though probe ensembles are largely unnecessary when retraining is
feasible.
Submitted
October 24, 2025
Key Contributions
This paper proposes novel methods for training AI models against interpretability probes (e.g., for toxicity detection) using Supervised Fine-tuning and Direct Preference Optimization. It addresses the challenge of Goodhart's Law by employing techniques like ensemble probes and held-out probes to maintain the reliability of these detectors even when they become training targets.
Business Value
Enhances the safety and trustworthiness of AI systems, particularly large language models, by providing mechanisms to reduce harmful outputs like toxicity while maintaining the ability to monitor internal model behavior.