Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: At its core, machine learning seeks to train models that reliably generalize
beyond noisy observations; however, the theoretical vacuum in which
state-of-the-art universal approximation theorems (UATs) operate isolates them
from this goal, as they assume noiseless data and allow network parameters to
be chosen freely, independent of algorithmic realism. This paper bridges that
gap by introducing an architecture-specific randomized training algorithm that
constructs a uniform approximator from $N$ noisy training samples on the
$d$-dimensional cube $[0,1]^d$. Our trained neural networks attain the
minimax-optimal quantity of \textit{trainable} (non-random) parameters, subject
to logarithmic factors which vanish under the idealized noiseless sampling
assumed in classical UATs.
Additionally, our trained models replicate key behaviours of real-world
neural networks, absent in standard UAT constructions, by: (1) exhibiting
sub-linear parametric complexity when fine-tuning on structurally related and
favourable out-of-distribution tasks, (2) exactly interpolating the training
data, and (3) maintaining reasonable Lipschitz regularity (after the initial
clustering attention layer). These properties bring state-of-the-art UATs
closer to practical machine learning, shifting the central open question from
algorithmic implementability with noisy samples to whether stochastic gradient
descent can achieve comparable guarantees.