Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 93% Match Research Paper ML Theorists,AI Researchers,Students of Machine Learning,Engineers interested in model efficiency 2 weeks ago

Weak-to-Strong Generalization Even in Random Feature Networks, Provably

large-language-models › training-methods
📄 Abstract

Abstract: Weak-to-Strong Generalization (Burns et al., 2024) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong learner like GPT-4. We consider student and teacher that are random feature models, described by two-layer networks with a random and fixed bottom layer and a trained top layer. A "weak" teacher, with a small number of units (i.e. random features), is trained on the population, and a "strong" student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove, and understand how the student can outperform the teacher, even though trained only on data labeled by the teacher. We also explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
Authors (6)
Marko Medvedev
Kaifeng Lyu
Dingli Yu
Sanjeev Arora
Zhiyuan Li
Nathan Srebro
Submitted
March 4, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

This paper provides a theoretical explanation and proof for the weak-to-strong generalization phenomenon, demonstrating it can occur even in simple random feature models. It shows how a student model with more units trained on labels from a weaker teacher can outperform the teacher, highlighting the role of early stopping and overparameterization.

Business Value

Offers insights into efficient knowledge transfer and model training strategies, potentially leading to methods for creating smaller, more efficient models that retain high performance by learning from larger, more complex ones.