arxiv_cv 88% Match Research Paper ML Engineers,Researchers in Model Compression,Developers for Edge Devices,Computer Vision Practitioners 2 weeks ago

Elastic ViTs from Pretrained Models without Retraining

generative-ai › diffusion

📄 Abstract

Abstract: Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/

Authors (5)

Walter Simoncini

Michael Dorkenwald

Tijmen Blankevoort

Cees G. M. Snoek

Yuki M. Asano

Submitted

October 20, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces SnapViT, a retraining-free structured pruning method for Vision Transformers that enables elastic inference across a range of compute budgets. It efficiently combines gradient information with cross-network structure correlations via an evolutionary algorithm, achieving superior performance over state-of-the-art methods with minimal computational cost.

Business Value

Allows for flexible deployment of powerful vision models on devices with varying computational capabilities, reducing costs and expanding accessibility.

Paper Metadata

Innovation Type

Algorithmic Improvement (Pruning)

Deployment Feasibility

Highly feasible for edge devices and resource-constrained environments due to its efficiency and retraining-free nature.

Limitations Addressed

Limited set of pre-determined sizes for vision foundation models, forcing sub-optimal deployment choices under real-world constraints.

Performance Gains

Superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU.

Technical Tags

structured pruningVision Transformers (ViTs)elastic inferencecompute budgetspost-pretrainingretraining-freegradient informationevolutionary algorithmpruningmodel compression

Research Topics

Model CompressionEfficient Deep LearningComputer VisionFoundation ModelsNeural Network Optimization

Methods & Architectures

SnapViT: Single-shot network approximationStructured pruningGradient informationCross-network structure correlationsEvolutionary algorithm Vision Transformer (ViT)

Applications & Tasks

Edge Computing Mobile Devices Resource-Constrained Environments Large-scale Model Deployment Model CompressionEfficient InferenceAdaptive Model Sizing Enabling elastic inference for ViTs across a continuum of compute budgetsCompressing pre-trained ViTs without retraining

Datasets & Benchmarks

Benchmarks

DINO • SigLIPv2 • DeIT • AugReg

Related Fields

Computer VisionMachine LearningDeep LearningModel CompressionEdge AI

Keywords

ViTpruningmodel compressionelastic inferencecompute budgetSnapViTretraining-freestructured pruningefficient inferencevision transformersedge AI

Academic Context

#Model Compression#Efficient Deep Learning#Computer Vision#Foundation Models#Neural Network Optimization

Commercial Potential

Potential Products

Libraries for elastic model deploymentTools for efficient fine-tuning of ViTs

Target Industries

MobileIoTAutomotiveRoboticsCloud Computing

Use Case Examples

Deploying advanced image recognition models on smartphonesEnabling real-time vision processing on embedded systemsOptimizing cloud inference costs for large-scale vision applications

Competitive Edge

Offers a retraining-free, efficient, and flexible approach to model compression for ViTs, outperforming existing methods in speed and effectiveness.

Market Opportunity

Large market for efficient AI models in edge and mobile applications.

Revenue Models

Licensing of the pruning technologyintegration into AI development platforms.

Resource Requirements

Compute Needs

Low for inference after pruning; moderate for the pruning process itself.

Data Requirements

Does not require labeled data for the pruning process itself, but benefits from pre-trained models.

Deployment Constraints

The pruned model's accuracy might slightly degrade compared to the original full model.

Scalability

Highly scalable for deployment on a wide range of hardware due to elastic inference.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Moderate, for the SnapViT algorithm and its application.

View Full Paper Back to Papers