Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research paper AI safety researchers,ML engineers working on alignment,AI ethicists,Researchers in human-AI interaction 2 weeks ago

Modeling Human Beliefs about AI Behavior for Scalable Oversight

ai-safety β€Ί alignment
πŸ“„ Abstract

Abstract: As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.
Authors (2)
Leon Lang
Patrick ForrΓ©
Submitted
February 28, 2025
arXiv Category
cs.AI
Transactions on Machine Learning Research, Aug. 2025. https://openreview.net/forum?id=gSJfsdQnex
arXiv PDF

Key Contributions

Proposes modeling human beliefs about AI behavior as a key to scalable oversight. It introduces formalisms for human belief models and 'belief model covering' to improve value learning from potentially flawed human feedback, even when evaluators misunderstand AI behavior.

Business Value

Enables the development of more reliable and aligned AI systems by providing a framework to interpret and leverage human feedback more effectively, even in complex scenarios.