arxiv_ai 95% Match Research Paper AI Safety Researchers,ML Researchers,Developers of large-scale AI systems 1 week ago

Weak-to-Strong Generalization under Distribution Shifts

ai-safety › alignment

📄 Abstract

Abstract: As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. Moreover, our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.

Authors (4)

Myeongho Jeon

Jan Sobotka

Suhwan Choi

Maria Brbić

Submitted

October 24, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper introduces RAVEN, a robust weak-to-strong generalization framework that addresses the failure of naive methods under distribution shifts. RAVEN dynamically learns optimal combinations of weak models to supervise strong models, significantly outperforming baselines on OOD tasks across various domains like image classification, text classification, and preference alignment.

Business Value

Enables the training of more reliable and robust AI systems that can generalize better to unseen data, crucial for safety-critical applications and reducing costly failures in production.

Paper Metadata

Innovation Type

Framework/Methodology

Deployment Feasibility

The RAVEN framework is designed to be integrated into the training pipeline, making it feasible for deployment in scenarios where robust generalization is required.

Limitations Addressed

The failure of naive weak-to-strong generalization under distribution shifts, which can lead to degraded performance of strong models compared to their weak supervisors.

Performance Gains

Over 30% improvement on out-of-distribution tasks,Matches or surpasses existing methods on in-distribution tasks

Technical Tags

weak-to-strong generalizationdistribution shiftsout-of-distribution generalizationrobustnessmodel supervisionRAVEN frameworkdynamic model combinationpreference alignmentimage classificationtext classification

Research Topics

AI AlignmentRobustness of Machine LearningSupervised LearningModel Generalization

Methods & Architectures

RAVEN FrameworkDynamic Learning of Weak Model CombinationsSupervised Learning under Distribution Shifts Strong ModelsWeak Models

Applications & Tasks

Machine Learning AI Safety Model Training and Supervision Failure of Weak-to-Strong Generalization under Distribution ShiftsSupervising Complex ModelsImproving OOD Performance Robust Model SupervisionOut-of-Distribution GeneralizationPreference Alignment

Datasets & Benchmarks

Benchmarks

Image classification tasks • Text classification tasks • Preference alignment tasks

Performance on OOD tasksPerformance on in-distribution tasks

Related Fields

Machine Learning TheoryRobustnessTransfer LearningAI Ethics

Keywords

Weak-to-Strong GeneralizationDistribution ShiftRobustnessAI AlignmentModel SupervisionOut-of-DistributionRAVENPreference AlignmentMachine LearningDeep LearningGeneralizationSupervised Learning

Academic Context

#AI Alignment#Robustness of Machine Learning#Supervised Learning#Model Generalization

Commercial Potential

Potential Products

Robust AI Training PlatformsModel Alignment ServicesOOD Detection Tools

Target Industries

TechnologyAutonomous SystemsHealthcare AIFinance AI

Use Case Examples

Training AI systems that need to perform reliably in diverse, unpredictable environmentsEnsuring AI models maintain performance when data distributions change over timeAligning complex AI behaviors with human preferences even when human supervision is limited

Competitive Edge

Provides a novel and effective solution for a critical challenge in AI safety and robustness (weak-to-strong generalization under distribution shifts), outperforming existing methods.

Market Opportunity

Significant market need for reliable and robust AI, especially in safety-critical domains.

Revenue Models

Licensing of the RAVEN frameworkconsulting services for AI robustness.

Resource Requirements

Compute Needs

High, requires training both weak and strong models, potentially across multiple tasks.

Data Requirements

Diverse datasets, including in-distribution and out-of-distribution data for various tasks.

Deployment Constraints

Requires careful setup of the weak model supervision pipeline.

Scalability

The framework aims to scale supervision capabilities for increasingly complex models.

Production Readiness

Maturity Level

Research

Time to Market

2-3 years (for integration into robust AI systems)

Patent Potential

Medium

View Full Paper Back to Papers