Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The transformer architecture has become the foundation of modern Large
Language Models (LLMs), yet its theoretical properties are still not well
understood. As with classic neural networks, a common approach to improve these
models is to increase their size and depth. However, such strategies may be
suboptimal, as several works have shown that adding more layers yields
increasingly diminishing returns. More importantly, prior studies have shown
that increasing depth may lead to model collapse, i.e., all the tokens converge
to a single cluster, undermining the ability of LLMs to generate diverse
outputs. Building on differential equation models for the transformer dynamics,
we prove that all the tokens in a transformer asymptotically converge to a
cluster as depth increases. At the technical level we leverage tools from
control theory, including consensus dynamics on manifolds and input-to-state
stability (ISS). We then extend our analysis to autoregressive models,
exploiting their structure to further generalize the theoretical guarantees.