arxiv_ml 75% Match Theoretical Research Paper Machine Learning Theorists,Deep Learning Researchers,Students of ML Theory 19 hours ago

Emergence and scaling laws in SGD learning of shallow neural networks

generative-ai › flow-models

📄 Abstract

Abstract: We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_*(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot \sigma(\langle\boldsymbol{x},\boldsymbol{v}_p^*\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}_d)$, where the activation $\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent $k_*>2$ (defined as the lowest degree in the Hermite expansion), $\{\boldsymbol{v}^*_p\}_{p\in[P]}\subset \mathbb{R}^d$ are orthonormal signal directions, and the non-negative second-layer coefficients satisfy $\sum_{p} a_p^2=1$. We focus on the challenging ``extensive-width'' regime $P\gg 1$ and permit diverging condition number in the second-layer, covering as a special case the power-law scaling $a_p\asymp p^{-\beta}$ where $\beta\in\mathbb{R}_{\ge 0}$. We provide a precise analysis of SGD dynamics for the training of a student two-layer network to minimize the mean squared error (MSE) objective, and explicitly identify sharp transition times to recover each signal direction. In the power-law setting, we characterize scaling law exponents for the MSE loss with respect to the number of training samples and SGD steps, as well as the number of parameters in the student neural network. Our analysis entails that while the learning of individual teacher neurons exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning curves at different timescales leads to a smooth scaling law in the cumulative objective.

Key Contributions

This paper provides a precise analysis of Stochastic Gradient Descent (SGD) dynamics for learning a two-layer neural network in the extensive-width regime. It characterizes the emergence of solutions and derives scaling laws, offering theoretical insights into the learning process and complexity of shallow networks under specific data distributions and network configurations.

Business Value

Provides fundamental theoretical understanding that can inform the design and training of more efficient and effective neural network architectures in the future.

Paper Metadata

Innovation Type

Theoretical Analysis

Deployment Feasibility

Not directly applicable, as this is a theoretical study.

Limitations Addressed

Lack of precise theoretical understanding of SGD dynamics in wide networks,Complexity of analyzing learning in overparameterized regimes,Understanding emergence and scaling laws

Technical Tags

SGD learningshallow neural networksextensive-width regimeonline learningHermite expansioninformation exponentisotropic Gaussian datasecond-layer coefficientsmean squared errorscaling lawsemergence

Research Topics

Machine Learning TheoryOptimization TheoryDeep Learning TheoryNeural Network DynamicsStatistical Learning Theory

Methods & Architectures

Stochastic Gradient Descent (SGD)Analysis of SGD dynamicsHermite ExpansionMean Squared Error minimization Two-layer Neural NetworkShallow Neural Network

Applications & Tasks

Machine Learning Theory Deep Learning Research Understanding SGD dynamicsLearning complexity of shallow networksScaling laws in neural networks Analyzing the training process of shallow neural networksCharacterizing the emergence of solutionsDeriving scaling laws for learning

Related Fields

Machine Learning TheoryOptimizationStatistical PhysicsInformation Theory

Keywords

SGDneural networksshallow networkslearning theoryoptimizationscaling lawsemergencestatistical learningdynamicsoverparameterizationHermite expansioninformation exponenttwo-layer network

Academic Context

#Machine Learning Theory#Optimization Theory#Deep Learning Theory#Neural Network Dynamics#Statistical Learning Theory

Commercial Potential

Competitive Edge

Contributes to the foundational understanding of deep learning, complementing empirical studies with rigorous theoretical analysis.

Market Opportunity

N/A

Revenue Models

N/A

Resource Requirements

Compute Needs

Minimal for theoretical analysis, potentially high for empirical validation if performed.

Data Requirements

Synthetic data (isotropic Gaussian) is used for analysis.

Deployment Constraints

N/A

Scalability

Focuses on the theoretical scaling properties of learning.

Regulatory Considerations

None.

Production Readiness

Maturity Level

Theoretical Foundation

Time to Market

N/A

Patent Potential

Low, as it's theoretical research.

View Full Paper Back to Papers