Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper Deep learning researchers,Computer vision engineers,AI architects,Students of machine learning 1 week ago

Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

computer-vision › scene-understanding
📄 Abstract

Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.
Authors (7)
Hao Yu
Haoyu Chen
Yan Jiang
Wei Peng
Zhaodong Sun
Samuel Kaski
+1 more
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Re-examines the design of Convolutional Neural Networks (CNNs) by identifying key principles that give Self-Attention (SA) its expressive power. It reveals that SA dynamically routes positional information based on semantic content, unlike static CNN kernels, challenging long-standing CNN design intuitions and paving the way for more powerful hybrid architectures.

Business Value

Leads to more efficient and powerful computer vision models, enabling faster and more accurate AI applications in areas like autonomous systems, medical imaging analysis, and content moderation.