Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ml 70% Match Research Paper Machine learning theorists,Deep learning researchers,NLP researchers 3 weeks ago

Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

large-language-models › model-architecture
📄 Abstract

Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$ with $M$ being the sample size, depending only on the smoothness $\beta$ of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
Authors (4)
Shai Zucker
Xiong Wang
Fei Lu
Inbar Seroussi
Submitted
October 13, 2025
arXiv Category
stat.ML
arXiv PDF

Key Contributions

This paper proves a dimension-free minimax rate for learning pairwise interactions in single-layer attention-style models. The rate depends only on the sample size and activation smoothness, not on token count or ambient dimension, highlighting the fundamental statistical efficiency of attention mechanisms. It provides theoretical understanding of why attention models can be effective.

Business Value

Provides theoretical justification for the effectiveness of attention mechanisms, guiding the design and training of more efficient and powerful deep learning models, particularly in areas like NLP and potentially beyond.