arxiv_ml 93% Match Research Paper ML Theorists,AI Researchers,Students of Machine Learning,Engineers interested in model efficiency 2 weeks ago

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

large-language-models › training-methods

📄 Abstract

Abstract: Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.

Authors (6)

Marko Medvedev

Kaifeng Lyu

Dingli Yu

Sanjeev Arora

Zhiyuan Li

Nathan Srebro

Submitted

March 4, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper provides a theoretical explanation and proof for the weak-to-strong generalization phenomenon, demonstrating it can occur even in simple random feature models. It shows how a student model with more units trained on labels from a weaker teacher can outperform the teacher, highlighting the role of early stopping and overparameterization.

Business Value

Offers insights into efficient knowledge transfer and model training strategies, potentially leading to methods for creating smaller, more efficient models that retain high performance by learning from larger, more complex ones.

Paper Metadata

Innovation Type

Theoretical Analysis

Deployment Feasibility

High, as it provides theoretical underpinnings for training strategies.

Limitations Addressed

Lack of theoretical understanding for weak-to-strong generalization,Explaining how a student can outperform a teacher trained on more data/better labels,Understanding generalization in overparameterized settings

Performance Gains

Provides theoretical understanding and bounds on performance gains achievable through weak-to-strong generalization.

Technical Tags

weak-to-strong generalizationrandom feature modelsstudent-teacher learningearly stoppinglabel propagationoverparameterizationgeneralization bounds

Research Topics

Machine Learning TheoryGeneralizationTransfer LearningDeep Learning TheoryOptimization

Methods & Architectures

Theoretical AnalysisProof-based DerivationsRandom Feature ModelsEarly Stopping Random Feature Models (two-layer networks)

Applications & Tasks

Machine Learning Theory AI Education Model Distillation Explaining weak-to-strong generalization phenomenonUnderstanding generalization in overparameterized modelsTheoretical limits of label propagation Proving weak-to-strong generalization in random feature modelsAnalyzing the role of early stopping in generalization

Related Fields

Machine Learning TheoryDeep LearningOptimizationStatistical Learning TheoryArtificial Intelligence

Keywords

weak-to-strong generalizationrandom feature modelsstudent-teacher learningearly stoppinggeneralizationoverparameterizationmachine learning theorylabel propagationdeep learningoptimization

Academic Context

#Machine Learning Theory#Generalization#Transfer Learning#Deep Learning Theory#Optimization

Commercial Potential

Potential Products

Model distillation frameworksTechniques for efficient model trainingAI education tools explaining generalization

Target Industries

Technology (AI Research)Software DevelopmentEducation

Use Case Examples

Training a smaller, faster model for mobile deployment by learning from a larger, more powerful cloud-based model.Improving the efficiency of training large models by leveraging intermediate 'teacher' models.

Competitive Edge

Provides a foundational theoretical explanation for a phenomenon observed in large models, offering insights into its mechanisms and limitations.

Market Opportunity

Significant interest in efficient AI model development and deployment.

Revenue Models

Consultingdevelopment of training optimization tools.

Resource Requirements

Compute Needs

Theoretical analysis requires minimal computation; practical application requires training resources.

Data Requirements

Requires datasets for training both teacher and student models.

Deployment Constraints

The theoretical results are derived for specific model classes (random feature models).,Practical application may require careful tuning of teacher/student configurations.

Scalability

The theoretical analysis provides insights into how generalization scales with model size.

Production Readiness

Maturity Level

Theoretical Foundation

Time to Market

Ongoing research and development

Patent Potential

Low, as it's a theoretical analysis.

View Full Paper Back to Papers