Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We study the complexity of online stochastic gradient descent (SGD) for
learning a two-layer neural network with $P$ neurons on isotropic Gaussian
data: $f_*(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot
\sigma(\langle\boldsymbol{x},\boldsymbol{v}_p^*\rangle)$, $\boldsymbol{x} \sim
\mathcal{N}(0,\boldsymbol{I}_d)$, where the activation
$\sigma:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent
$k_*>2$ (defined as the lowest degree in the Hermite expansion),
$\{\boldsymbol{v}^*_p\}_{p\in[P]}\subset \mathbb{R}^d$ are orthonormal signal
directions, and the non-negative second-layer coefficients satisfy $\sum_{p}
a_p^2=1$. We focus on the challenging ``extensive-width'' regime $P\gg 1$ and
permit diverging condition number in the second-layer, covering as a special
case the power-law scaling $a_p\asymp p^{-\beta}$ where
$\beta\in\mathbb{R}_{\ge 0}$. We provide a precise analysis of SGD dynamics for
the training of a student two-layer network to minimize the mean squared error
(MSE) objective, and explicitly identify sharp transition times to recover each
signal direction. In the power-law setting, we characterize scaling law
exponents for the MSE loss with respect to the number of training samples and
SGD steps, as well as the number of parameters in the student neural network.
Our analysis entails that while the learning of individual teacher neurons
exhibits abrupt transitions, the juxtaposition of $P\gg 1$ emergent learning
curves at different timescales leads to a smooth scaling law in the cumulative
objective.
Key Contributions
This paper provides a precise analysis of Stochastic Gradient Descent (SGD) dynamics for learning a two-layer neural network in the extensive-width regime. It characterizes the emergence of solutions and derives scaling laws, offering theoretical insights into the learning process and complexity of shallow networks under specific data distributions and network configurations.
Business Value
Provides fundamental theoretical understanding that can inform the design and training of more efficient and effective neural network architectures in the future.