Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
Proposes modeling human beliefs about AI behavior as a key to scalable oversight. It introduces formalisms for human belief models and 'belief model covering' to improve value learning from potentially flawed human feedback, even when evaluators misunderstand AI behavior.
Enables the development of more reliable and aligned AI systems by providing a framework to interpret and leverage human feedback more effectively, even in complex scenarios.