arxiv_ai 90% Match Research Paper Computer Vision Researchers,ML Engineers,Robotics Engineers,AI Developers 1 week ago

VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

computer-vision › video-understanding

📄 Abstract

Abstract: Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.

Authors (4)

Jesimon Barreto

Carlos Caetano

André Araujo

William Robson Schwartz

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

VESSA introduces a novel object-centric self-supervised fine-tuning method for vision foundation models, enabling adaptation to new domains using only short multi-view videos without annotations. It leverages a self-distillation paradigm, proving effective for vision encoder models where prior SSL adaptation methods struggled, thus addressing domain shift challenges with limited data.

Business Value

Allows for rapid and cost-effective adaptation of powerful vision models to specific industry needs or new environments, reducing the reliance on large, labeled datasets and accelerating deployment.

Paper Metadata

Innovation Type

Algorithmic Innovation

Deployment Feasibility

Moderate, requires video data and computational resources for fine-tuning.

Limitations Addressed

Underperformance of foundation models in out-of-distribution domains and the infeasibility of supervised fine-tuning due to scarce labels; addresses limitations of prior SSL methods for vision encoders.

Technical Tags

Self-Supervised LearningFoundation ModelsVisual AdaptationObject-Centric LearningVideo AnalysisDomain AdaptationDistribution ShiftSelf-DistillationEncoder ModelsMulti-view Learning

Research Topics

Self-Supervised LearningFoundation ModelsDomain AdaptationComputer VisionVideo Understanding

Methods & Architectures

Video-based objEct-centric Self-Supervised Adaptation (VESSA)Self-distillation paradigmFine-tuning encoder models Foundation Models (Vision)Encoder Models

Applications & Tasks

Computer Vision Robotics Autonomous Systems Video Analysis Underperformance in new domains with distribution shiftsScarcity of labeled data for fine-tuningIneffectiveness of continued self-supervised learning for vision encoders Domain adaptation for visual foundation modelsAdapting models to new environments without annotations

Related Fields

Computer VisionMachine LearningDeep LearningSelf-Supervised LearningRobotics

Keywords

self-supervised learningfoundation modelsdomain adaptationvideoobject-centriccomputer visionfine-tuningdistribution shiftunsupervised learningencoderVESSA

Academic Context

#Self-Supervised Learning#Foundation Models#Domain Adaptation#Computer Vision#Video Understanding

Commercial Potential

Potential Products

Adaptable vision APIsDomain-specific vision models

Target Industries

RoboticsAutonomous VehiclesSurveillanceManufacturingRetail

Use Case Examples

Adapting a robot's vision system to a new factory floorCustomizing a surveillance system for a new environment

Competitive Edge

Offers a new self-supervised adaptation method (VESSA) specifically for vision foundation models using video, addressing limitations of previous SSL techniques for encoders and supervised fine-tuning in low-data regimes.

Market Opportunity

Large market for adaptable computer vision solutions.

Revenue Models

Licensing of adapted modelsservices for model adaptation.

Resource Requirements

Compute Needs

Requires significant compute for training foundation models and fine-tuning.

Data Requirements

Requires short, multi-view object-centric videos from the target domain.

Deployment Constraints

Effectiveness depends on the quality and relevance of the video data for adaptation.

Scalability

Scales with the size of the foundation model being adapted.

Production Readiness

Maturity Level

Research

Time to Market

Medium-term, for integration into specific applications.

View Full Paper Back to Papers