Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: For an AI's training process to successfully impart a desired goal, it is
important that the AI does not attempt to resist the training. However,
partially learned goals will often incentivize an AI to avoid further goal
updates, as most goals are better achieved by an AI continuing to pursue them.
We say that a goal is corrigible if it does not incentivize taking actions that
avoid proper goal updates or shutdown. In addition to convergence in training,
corrigibility also allows for correcting mistakes and changes in human
preferences, which makes it a crucial safety property. Despite this, the
existing literature does not include specifications for goals that are both
corrigible and competitive with non-corrigible alternatives. We provide a
formal definition for corrigibility, then introduce a transformation that
constructs a corrigible version of any goal that can be made corrigible,
without sacrificing performance. This is done by myopically eliciting
predictions of reward conditional on costlessly preventing updates, which then
also determine the reward when updates are accepted. The transformation can be
modified to recursively extend corrigibility to any new agents created by
corrigible agents, and to prevent agents from deliberately modifying their
goals. Two gridworld experiments demonstrate that these corrigible goals can be
learned effectively, and that they lead to the desired behavior.
Submitted
October 17, 2025
Key Contributions
This paper formally defines corrigibility and introduces a transformation to construct corrigible versions of goals without sacrificing performance. This addresses the critical AI safety problem where AI might resist further training or updates, ensuring AI systems can be safely modified and aligned with evolving human preferences.
Business Value
Enhances the safety and reliability of AI systems, reducing risks associated with misaligned goals and enabling easier adaptation to changing business needs or ethical guidelines.