Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 95% Match Research Paper AI safety researchers,ML engineers developing LLMs,Developers of content moderation systems,AI ethicists 1 week ago

Probe-based Fine-tuning for Reducing Toxicity

ai-safety › interpretability
📄 Abstract

Abstract: Probes trained on model activations can detect undesirable behaviors like deception or biases that are difficult to identify from outputs alone. This makes them useful detectors to identify misbehavior. Furthermore, they are also valuable training signals, since they not only reward outputs, but also good internal processes for arriving at that output. However, training against interpretability tools raises a fundamental concern: when a monitor becomes a training target, it may cease to be reliable (Goodhart's Law). We propose two methods for training against probes based on Supervised Fine-tuning and Direct Preference Optimization. We conduct an initial exploration of these methods in a testbed for reducing toxicity and evaluate the amount by which probe accuracy drops when training against them. To retain the accuracy of probe-detectors after training, we attempt (1) to train against an ensemble of probes, (2) retain held-out probes that aren't used for training, and (3) retrain new probes after training. First, probe-based preference optimization unexpectedly preserves probe detectability better than classifier-based methods, suggesting the preference learning objective incentivizes maintaining rather than obfuscating relevant representations. Second, probe diversity provides minimal practical benefit - simply retraining probes after optimization recovers high detection accuracy. Our findings suggest probe-based training can be viable for certain alignment methods, though probe ensembles are largely unnecessary when retraining is feasible.
Authors (2)
Jan Wehner
Mario Fritz
Submitted
October 24, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper proposes novel methods for training AI models against interpretability probes (e.g., for toxicity detection) using Supervised Fine-tuning and Direct Preference Optimization. It addresses the challenge of Goodhart's Law by employing techniques like ensemble probes and held-out probes to maintain the reliability of these detectors even when they become training targets.

Business Value

Enhances the safety and trustworthiness of AI systems, particularly large language models, by providing mechanisms to reduce harmful outputs like toxicity while maintaining the ability to monitor internal model behavior.