Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We study the problem of length generalization (LG) in transformers: the
ability of a model trained on shorter sequences to maintain performance when
evaluated on much longer, previously unseen inputs. Prior work by Huang et al.
(2025) established that transformers eventually achieve length generalization
once the training sequence length exceeds some finite threshold, but left open
the question of how large it must be. In this work, we provide the first
quantitative bounds on the required training length for length generalization
to occur. Motivated by previous empirical and theoretical work, we analyze LG
in several distinct problem settings: $\ell_\infty$ error control vs. average
error control over an input distribution, infinite-precision softmax attention
vs. finite-precision attention (which reduces to an argmax) in the transformer,
and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs
when the internal behavior of the transformer on longer sequences can be
"simulated" by its behavior on shorter sequences seen during training. Our
bounds give qualitative estimates for the length of training data required for
a transformer to generalize, and we verify these insights empirically. These
results sharpen our theoretical understanding of the mechanisms underlying
extrapolation in transformers, and formalize the intuition that richer training
data is required for generalization on more complex tasks.
Authors (3)
Zachary Izzo
Eshaan Nichani
Jason D. Lee
Submitted
October 30, 2025
Key Contributions
This paper provides the first quantitative bounds on the required training sequence length for transformers to achieve length generalization. It analyzes this phenomenon across different settings, including error control types, attention precision, and transformer depth, offering theoretical insights into why transformers struggle with longer sequences than they were trained on.
Business Value
Understanding and improving the length generalization capabilities of transformers can lead to more reliable and efficient NLP models that can handle diverse input lengths without significant performance drops, crucial for applications with variable text lengths.