arxiv_ml 85% Match Theoretical Research ML Researchers,Deep Learning Engineers,NLP Practitioners 4 days ago

Quantitative Bounds for Length Generalization in Transformers

large-language-models › model-architecture

📄 Abstract

Abstract: We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $\ell_\infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

Authors (3)

Zachary Izzo

Eshaan Nichani

Jason D. Lee

Submitted

October 30, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper provides the first quantitative bounds on the required training sequence length for transformers to achieve length generalization. It analyzes this phenomenon across different settings, including error control types, attention precision, and transformer depth, offering theoretical insights into why transformers struggle with longer sequences than they were trained on.

Business Value

Understanding and improving the length generalization capabilities of transformers can lead to more reliable and efficient NLP models that can handle diverse input lengths without significant performance drops, crucial for applications with variable text lengths.

Paper Metadata

Innovation Type

Theoretical Breakthrough

Deployment Feasibility

High (theoretical, informs model design)

Limitations Addressed

The lack of quantitative understanding of the minimum training length required for transformers to generalize to longer sequences.

Technical Tags

TransformersLength GeneralizationSequence LengthAttention MechanismsSoftmax AttentionFinite PrecisionTheoretical BoundsDeep Learning Theory

Research Topics

Model RobustnessTheoretical GuaranteesTransformer ArchitecturesGeneralization in Deep LearningSequence Modeling

Methods & Architectures

Theoretical AnalysisMathematical ProofsError Control Analysis Transformer

Applications & Tasks

Natural Language Processing Sequence Modeling GeneralizationExtrapolationPerformance Degradation Length Generalization

Related Fields

Machine Learning TheoryDeep LearningComputational Linguistics

Keywords

TransformersLength GeneralizationSequence LengthAttentionSoftmaxGeneralizationDeep LearningTheoretical BoundsNLPSequence ModelingFinite PrecisionError Control

Academic Context

#Model Robustness#Theoretical Guarantees#Transformer Architectures#Generalization in Deep Learning#Sequence Modeling

Commercial Potential

Competitive Edge

Provides theoretical underpinnings for observed phenomena, complementing empirical studies.

Resource Requirements

Compute Needs

N/A (theoretical)

Data Requirements

N/A (theoretical)

Deployment Constraints

N/A (theoretical)

Scalability

N/A (theoretical)

Production Readiness

Maturity Level

Theoretical Foundation

View Full Paper Back to Papers