arxiv_ai 90% Match Research Paper Computer Vision Researchers,ML Engineers,AI Researchers working with Transformers 2 weeks ago

Accelerating Vision Transformers with Adaptive Patch Sizes

computer-vision › object-detection

📄 Abstract

Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.

Authors (6)

Rohan Choudhury

JungEun Kim

Jinhyung Park

Eunho Yang

László A. Jeni

Kris M. Kitani

Submitted

October 20, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Adaptive Patch Transformers (APT) accelerate Vision Transformers (ViTs) by dynamically adjusting patch sizes based on image content, using larger patches in homogeneous regions and smaller ones in complex areas. This reduces the number of input tokens, leading to significant speedups in training and inference (up to 40-50%) without compromising downstream task performance.

Business Value

Enables faster and more cost-effective deployment of powerful Vision Transformer models, particularly for high-resolution image analysis tasks, making advanced computer vision applications more practical.

Paper Metadata

Innovation Type

Algorithmic Modification

Deployment Feasibility

High. The method is applied during training/inference and can be retrofitted to existing fine-tuned ViTs with minimal effort.

Limitations Addressed

Long sequence lengths and computational burden of ViTs on high-resolution images,Inefficiency of uniform patch sizes,Slow training and inference times

Performance Gains

40% throughput increase on ViT-L, 50% on ViT-H, up to 30% faster training/inference in dense visual tasks.

Technical Tags

vision transformersadaptive patch sizesimage processingcomputational efficiencyinference speedtraining speedtoken reductionhigh-resolution imagesobject detectionsemantic segmentation

Research Topics

Computer VisionDeep LearningEfficient AIImage AnalysisTransformer Models

Methods & Architectures

Adaptive Patch Transformers (APT)Variable patch sizingContent-aware patching Vision Transformer (ViT)Adaptive Patch Transformer (APT)

Applications & Tasks

Image Recognition Computer Vision Tasks Medical Imaging Autonomous Driving Long input sequence lengths in ViTs for high-resolution imagesComputational inefficiency of uniform patch sizesSlow training and inference times in ViTs Accelerating ViT training and inferenceReducing computational cost of ViTsImproving performance on high-resolution image tasks

Datasets & Benchmarks

Benchmarks

ViT-L throughput increase: 40% • ViT-H throughput increase: 50% • Up to 30% faster training/inference in dense visual tasks

Related Fields

Machine LearningDeep LearningImage ProcessingComputer Vision

Keywords

vision transformersadaptive patch sizeViTimage processingcomputational efficiencyinference speedtraining speedtokenizationhigh resolutionobject detectionsemantic segmentationtransformer architecture

Academic Context

#Computer Vision#Deep Learning#Efficient AI#Image Analysis#Transformer Models

Commercial Potential

Potential Products

Faster computer vision models for real-time applicationsEfficient image analysis tools for high-resolution dataOptimized ViT libraries

Target Industries

TechnologyHealthcare (Medical Imaging)Automotive (Autonomous Driving)Security (Surveillance)

Use Case Examples

Real-time object detection in high-resolution surveillance footageAccelerated medical image analysis for diagnosisFaster visual processing for autonomous vehicles

Competitive Edge

Provides a significant speedup for Vision Transformers by addressing the fundamental inefficiency of uniform patch sizes, offering a practical way to improve performance without architectural overhauls.

Market Opportunity

Large and growing market for efficient computer vision solutions.

Revenue Models

Licensing of the APT technologyintegration servicesdevelopment of optimized vision model libraries.

Resource Requirements

Compute Needs

Reduced compared to standard ViTs due to efficiency gains.

Data Requirements

Standard image datasets used for training and evaluating vision models.

Deployment Constraints

Compatibility with existing ViT implementations, potential need for fine-tuning.

Scalability

Improves scalability by reducing computational requirements for high-resolution images.

Regulatory Considerations

None directly mentionedbut applications in regulated fields (e.g.medical imaging) would require compliance.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years (for integration into existing frameworks and products)

Patent Potential

Moderate (novel adaptive patching mechanism)

View Full Paper Back to Papers