Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for
its powerful expressivity over traditional Convolutions (Conv). However, its
quadratic complexity remains a critical bottleneck for practical applications.
Given that Conv offers linear complexity and strong visual priors, continuing
efforts have been made to promote the renaissance of Conv. However, a
persistent performance chasm remains, highlighting that these modernizations
have not yet captured the intrinsic expressivity that defines SA. In this
paper, we re-examine the design of the CNNs, directed by a key question: what
principles give SA its edge over Conv? As a result, we reveal two fundamental
insights that challenge the long-standing design intuitions in prior research
(e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}:
SA dynamically regulates positional information flow according to semantic
content, whereas Conv employs static kernels uniformly across all positions.
(2) \textit{Lateral inhibition}: SA induces score competition among token
weighting, effectively suppressing redundancy and sharpening representations,
whereas Conv filters lack such inhibitory dynamics and exhibit considerable
redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv),
a principled reformulation of the convolutional operator that intrinsically
injects these principles. Interestingly, with only $3\times3$ kernels, ATConv
consistently outperforms various SA mechanisms in fundamental vision tasks.
Building on ATConv, we introduce AttNet, a CNN family that can attain
\textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In
diffusion-based image generation, replacing all SA with the proposed $3\times
3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster
sampling. Code is available at: github.com/price112/Attentive-Convolution.
Authors (7)
Hao Yu
Haoyu Chen
Yan Jiang
Wei Peng
Zhaodong Sun
Samuel Kaski
+1 more
Submitted
October 23, 2025
Key Contributions
Re-examines the design of Convolutional Neural Networks (CNNs) by identifying key principles that give Self-Attention (SA) its expressive power. It reveals that SA dynamically routes positional information based on semantic content, unlike static CNN kernels, challenging long-standing CNN design intuitions and paving the way for more powerful hybrid architectures.
Business Value
Leads to more efficient and powerful computer vision models, enabling faster and more accurate AI applications in areas like autonomous systems, medical imaging analysis, and content moderation.