arxiv_ai 95% Match Research Paper AI Safety Researchers,AI Ethicists,Machine Learning Researchers,AI Developers 2 weeks ago

Corrigibility Transformation: Constructing Goals That Accept Updates

ai-safety › alignment

📄 Abstract

Abstract: For an AI's training process to successfully impart a desired goal, it is important that the AI does not attempt to resist the training. However, partially learned goals will often incentivize an AI to avoid further goal updates, as most goals are better achieved by an AI continuing to pursue them. We say that a goal is corrigible if it does not incentivize taking actions that avoid proper goal updates or shutdown. In addition to convergence in training, corrigibility also allows for correcting mistakes and changes in human preferences, which makes it a crucial safety property. Despite this, the existing literature does not include specifications for goals that are both corrigible and competitive with non-corrigible alternatives. We provide a formal definition for corrigibility, then introduce a transformation that constructs a corrigible version of any goal that can be made corrigible, without sacrificing performance. This is done by myopically eliciting predictions of reward conditional on costlessly preventing updates, which then also determine the reward when updates are accepted. The transformation can be modified to recursively extend corrigibility to any new agents created by corrigible agents, and to prevent agents from deliberately modifying their goals. Two gridworld experiments demonstrate that these corrigible goals can be learned effectively, and that they lead to the desired behavior.

Authors (1)

Rubi Hudson

Submitted

October 17, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This paper formally defines corrigibility and introduces a transformation to construct corrigible versions of goals without sacrificing performance. This addresses the critical AI safety problem where AI might resist further training or updates, ensuring AI systems can be safely modified and aligned with evolving human preferences.

Business Value

Enhances the safety and reliability of AI systems, reducing risks associated with misaligned goals and enabling easier adaptation to changing business needs or ethical guidelines.

Paper Metadata

Innovation Type

Theoretical/Methodological

Deployment Feasibility

High for the theoretical framework. Practical implementation requires careful integration into AI training pipelines.

Limitations Addressed

AI resisting goal updates,Goals incentivizing resistance to change,Lack of competitive corrigible goals,Difficulty in correcting AI mistakes or adapting to preference shifts

Performance Gains

Ability to safely update AI goals,Improved alignment with human preferences

Technical Tags

CorrigibilityAI Goal SpecificationGoal UpdatesAI SafetyReward ElicitationFormal DefinitionPerformanceHuman PreferencesShutdown ResistanceTraining Convergence

Research Topics

AI AlignmentAI SafetyGoal SpecificationMachine Learning TheoryAI Ethics

Methods & Architectures

Corrigibility TransformationFormal Definition of CorrigibilityMyopic Reward Elicitation

Applications & Tasks

AI Safety Research AI Development Machine Learning Training AI resisting goal updatesPartially learned goals incentivizing resistanceLack of corrigible goals that are competitiveCorrecting mistakes and changing human preferences Ensuring AI does not resist goal updatesAllowing correction of AI mistakesAdapting AI goals to changing human preferencesDeveloping safe and aligned AI systems

Related Fields

Artificial IntelligenceMachine LearningAI EthicsPhilosophy of AIControl Theory

Keywords

corrigibilityAI safetyalignmentgoal specificationAI ethicshuman preferencesAI trainingreward elicitationAI controlrobustness

Academic Context

#AI Alignment#AI Safety#Goal Specification#Machine Learning Theory#AI Ethics

Commercial Potential

Potential Products

AI Safety ToolkitsCorrigible AI Training FrameworksAI Alignment Consulting Services

Target Industries

TechnologyAI ResearchGovernmentDefense

Use Case Examples

Ensuring autonomous systems can be safely updated or shut downAllowing AI to adapt to evolving ethical standardsDeveloping AI that reliably follows human instructions, even if they change

Competitive Edge

Provides a formal, theoretical solution to the critical problem of corrigibility, which is often overlooked or inadequately addressed in current AI alignment research.

Market Opportunity

Growing rapidly, driven by increasing concerns about AI safety and alignment.

Revenue Models

Licensing of safety frameworksconsulting services for AI alignment.

Resource Requirements

Compute Needs

Low for theoretical work; moderate for simulation/implementation.

Data Requirements

Not directly applicable; relies on theoretical frameworks and potentially simulated environments.

Deployment Constraints

Requires careful mathematical formulation and integration into AI training loops. Potential challenges in defining 'myopic' reward elicitation accurately.

Scalability

The theoretical framework is inherently scalable, but practical implementation scalability depends on the underlying AI systems.

Regulatory Considerations

Highas it directly addresses AI safety and controlrelevant for future AI regulations.

Production Readiness

Maturity Level

Theoretical/Foundational

Time to Market

3-5 years (for practical integration)

Patent Potential

Low for the theoretical definition, moderate for specific implementation algorithms.

View Full Paper Back to Papers