arxiv_ml 85% Match Research Paper Computer Vision Researchers,Deep Learning Engineers,AI Researchers 2 weeks ago

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

computer-vision › model-architecture

📄 Abstract

Abstract: Kolmogorov-Arnold networks (KANs) are a remarkable innovation that consists of learnable activation functions, with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). This work asks whether KAN could learn token interactions. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect the performance of these architectures by analyzing their loss landscapes, weight distributions, optimizer paths, attention visualizations, and transferability to other datasets. KArAt's learnable activation yields a better attention score across all ViTs, indicating improved token-to-token interactions and contributing to enhanced inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures.

Authors (4)

Subhajit Maity

Killian Hitsman

Xin Li

Aritra Dutta

Submitted

March 13, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper introduces Kolmogorov-Arnold Attention (KArAt) for Vision Transformers, a novel learnable attention mechanism that can operate on various bases. To address the memory explosion issue, a modular version with low-rank approximation is proposed. This work explores whether learnable activations can improve token interaction learning in ViTs, potentially outperforming traditional softmax attention.

Business Value

Enhanced performance in computer vision tasks could lead to more accurate image recognition systems for applications like autonomous driving, medical imaging analysis, and content moderation.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

Modular design and low-rank approximation aim to improve feasibility, but memory constraints remain a consideration.

Limitations Addressed

Memory explosion in learnable attention,Limited expressiveness of traditional attention mechanisms

Performance Gains

Comparable or superior performance to traditional softmax counterparts on CIFAR-10, CIFAR-100, and ImageNet-1K.

Technical Tags

Vision TransformersKolmogorov-Arnold NetworksLearnable AttentionFourier BasisLow-Rank ApproximationModular AttentionRational FunctionsWaveletsSplines

Research Topics

Attention MechanismsNeural Network ArchitecturesDeep Learning TheoryComputer VisionRepresentation Learning

Methods & Architectures

Kolmogorov-Arnold Attention (KArAt)Low-rank approximationFourier basis transformation Vision Transformer (ViT)Kolmogorov-Arnold Network (KAN)

Applications & Tasks

Image Recognition Computer Vision Tasks Improving Attention MechanismsReducing Memory Usage in AttentionEnhancing ViT Performance Image Classification

Datasets & Benchmarks

Datasets

CIFAR-10, CIFAR-100, ImageNet-1K

Related Fields

Deep LearningMachine LearningArtificial IntelligenceNeural Networks

Keywords

Vision TransformerKolmogorov-Arnold NetworkAttention MechanismLearnable ActivationImage ClassificationDeep LearningNeural Architecture SearchFourier TransformLow-Rank ApproximationModular DesignComputer VisionTransformer Models

Academic Context

#Attention Mechanisms#Neural Network Architectures#Deep Learning Theory#Computer Vision#Representation Learning

Commercial Potential

Potential Products

Advanced image recognition systemsMore efficient vision models

Target Industries

TechnologyHealthcareAutomotiveSecurity

Use Case Examples

Object detectionImage segmentationMedical image analysis

Competitive Edge

Offers a novel alternative to standard attention mechanisms in ViTs, aiming for improved performance and potentially better generalization by incorporating learnable activation functions.

Resource Requirements

Compute Needs

Likely high, especially for training on large datasets like ImageNet, though the proposed modularity and low-rank approximation aim to mitigate this.

Data Requirements

Requires large-scale image datasets for training and evaluation.

Deployment Constraints

Computational resources,Memory usage

Scalability

The modular design and low-rank approximation are intended to improve scalability compared to a naive implementation of learnable attention.

Production Readiness

Maturity Level

Research/Experimental

Time to Market

Longer term, pending further validation and optimization.

View Full Paper Back to Papers