Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Diffusion probabilistic models have become a cornerstone of modern generative
AI, yet the mechanisms underlying their generalization remain poorly
understood. In fact, if these models were perfectly minimizing their training
loss, they would just generate data belonging to their training set, i.e.,
memorize, as empirically found in the overparameterized regime. We revisit this
view by showing that, in highly overparameterized diffusion models,
generalization in natural data domains is progressively achieved during
training before the onset of memorization. Our results, ranging from image to
language diffusion models, systematically support the empirical law that
memorization time is proportional to the dataset size. Generalization vs.
memorization is then best understood as a competition between time scales. We
show that this phenomenology is recovered in diffusion models learning a simple
probabilistic context-free grammar with random rules, where generalization
corresponds to the hierarchical acquisition of deeper grammar rules as training
time grows, and the generalization cost of early stopping can be characterized.
We summarize these results in a phase diagram. Overall, our results support
that a principled early-stopping criterion - scaling with dataset size - can
effectively optimize generalization while avoiding memorization, with direct
implications for hyperparameter transfer and privacy-sensitive applications.