arxiv_cv 90% Match Research Paper Deep learning researchers,Computer vision engineers,AI architects,Students of machine learning 1 week ago

Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

computer-vision › scene-understanding

📄 Abstract

Abstract: Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3\times3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf{84.4\%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3\times 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.

Authors (7)

Hao Yu

Haoyu Chen

Yan Jiang

Wei Peng

Zhaodong Sun

Samuel Kaski

+1 more

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Re-examines the design of Convolutional Neural Networks (CNNs) by identifying key principles that give Self-Attention (SA) its expressive power. It reveals that SA dynamically routes positional information based on semantic content, unlike static CNN kernels, challenging long-standing CNN design intuitions and paving the way for more powerful hybrid architectures.

Business Value

Leads to more efficient and powerful computer vision models, enabling faster and more accurate AI applications in areas like autonomous systems, medical imaging analysis, and content moderation.

Paper Metadata

Innovation Type

Theoretical Insight and Architectural Re-design Principles

Deployment Feasibility

High potential for future model development. The insights guide the design of new architectures that could be more efficient and effective.

Limitations Addressed

Quadratic complexity of self-attention as a bottleneck,Performance gap between modern CNNs and SA-based models,Limitations of existing CNN modernizations in capturing SA's expressivity,Static nature of CNN kernels

Performance Gains

Aims to achieve performance gains by unifying the strengths of SA and Conv, though specific quantitative gains are not detailed in the abstract.

Technical Tags

self-attentionconvolutional neural networksvision backbonescomputational complexityexpressivityadaptive routingpositional informationsemantic contentstatic kernelsCNN design

Research Topics

Deep Learning ArchitecturesComputer VisionNeural Network DesignModel EfficiencyRepresentation Learning

Methods & Architectures

Re-examination of CNN design principlesAnalysis of self-attention mechanismsProposed 'Attentive Convolution' concept (implied) Convolutional Neural Networks (CNNs)Self-Attention (SA) based modelsVision Transformers (ViTs)

Applications & Tasks

Image Recognition Object Detection Video Analysis General Computer Vision Tasks Balancing Expressivity and EfficiencyImproving CNN PerformanceCapturing Long-Range DependenciesReducing Computational Cost Unifying the expressivity of self-attention with the efficiency of convolutionsChallenging existing CNN design intuitionsDeveloping next-generation vision backbones

Related Fields

Deep LearningComputer VisionNeural Architecture SearchArtificial Intelligence

Keywords

self-attentionconvolutionCNNvision transformerneural network architecturecomputational efficiencyexpressivityadaptive routingpositional encodingsemantic segmentationobject detectionimage recognitiondeep learningmodel design

Academic Context

#Deep Learning Architectures#Computer Vision#Neural Network Design#Model Efficiency#Representation Learning

Commercial Potential

Potential Products

Next-generation vision backbones for various AI tasksMore efficient deep learning models for edge devices

Target Industries

TechnologyAutomotiveRoboticsHealthcareSecurityMedia

Use Case Examples

Faster and more accurate object recognition in autonomous vehiclesImproved image analysis for medical diagnosticsMore efficient video surveillance systems

Competitive Edge

Aims to bridge the gap between the high performance of self-attention models and the efficiency of convolutional networks, potentially creating superior hybrid architectures.

Market Opportunity

The market for efficient and high-performance computer vision models is vast.

Revenue Models

Licensing of novel architecturesintegration into AI platforms and services.

Resource Requirements

Compute Needs

The insights aim to guide the development of models with potentially lower compute requirements than pure self-attention models, while maintaining high performance.

Data Requirements

Requires large-scale image and video datasets for training and evaluation of proposed architectures.

Deployment Constraints

Need for extensive empirical validation of proposed architectures,Potential complexity in implementing hybrid models

Scalability

Architectures derived from these principles are expected to be scalable, balancing performance and efficiency.

Production Readiness

Maturity Level

Conceptual/Theoretical

Time to Market

2-5 years for new architectures based on these principles to become widely adopted.

Patent Potential

Moderate to High, for novel architectural designs based on these principles.

View Full Paper Back to Papers