Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI Safety Researchers,AI Ethicists,AI Developers,Researchers in AI Control 1 week ago

Scaling Laws For Scalable Oversight

ai-safety › alignment
📄 Abstract

Abstract: Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To address this gap, we propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen. Specifically, our framework models oversight as a game between capability-mismatched players; the players have oversight-specific Elo scores that are a piecewise-linear function of their general intelligence, with two plateaus corresponding to task incompetence and task saturation. We validate our framework with a modified version of the game Nim and then apply it to four oversight games: Mafia, Debate, Backdoor Code and Wargames. For each game, we find scaling laws that approximate how domain performance depends on general AI system capability. We then build on our findings in a theoretical study of Nested Scalable Oversight (NSO), a process in which trusted models oversee untrusted stronger models, which then become the trusted models in the next step. We identify conditions under which NSO succeeds and derive numerically (and in some cases analytically) the optimal number of oversight levels to maximize the probability of oversight success. We also apply our theory to our four oversight games, where we find that NSO success rates at a general Elo gap of 400 are 13.5% for Mafia, 51.7% for Debate, 10.0% for Backdoor Code, and 9.4% for Wargames; these rates decline further when overseeing stronger systems.
Authors (4)
Joshua Engels
David D. Baek
Subhash Kantamneni
Max Tegmark
Submitted
April 25, 2025
arXiv Category
cs.AI
NeurIPS 2025 (Spotlight)
arXiv PDF

Key Contributions

This paper proposes a novel framework to quantify the probability of successful scalable oversight, modeling it as a game between capability-mismatched AI systems. It introduces oversight-specific Elo scores that are functions of general intelligence and validates these scaling laws across various oversight games, providing a method to understand how oversight effectiveness scales with AI capabilities.

Business Value

Provides a theoretical foundation for developing more robust AI safety mechanisms, crucial for the long-term development and deployment of advanced AI systems.