Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 90% Match Research Paper Computer Vision Researchers,ML Engineers,Robotics Engineers,AI Developers 1 week ago

VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

computer-vision β€Ί video-understanding
πŸ“„ Abstract

Abstract: Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at https://github.com/jesimonbarreto/VESSA.
Authors (4)
Jesimon Barreto
Carlos Caetano
AndrΓ© Araujo
William Robson Schwartz
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

VESSA introduces a novel object-centric self-supervised fine-tuning method for vision foundation models, enabling adaptation to new domains using only short multi-view videos without annotations. It leverages a self-distillation paradigm, proving effective for vision encoder models where prior SSL adaptation methods struggled, thus addressing domain shift challenges with limited data.

Business Value

Allows for rapid and cost-effective adaptation of powerful vision models to specific industry needs or new environments, reducing the reliance on large, labeled datasets and accelerating deployment.