Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Foundation models have advanced computer vision by enabling strong
performance across diverse tasks through large-scale pretraining and supervised
fine-tuning. However, they may underperform in domains with distribution shifts
and scarce labels, where supervised fine-tuning may be infeasible. While
continued self-supervised learning for model adaptation is common for
generative language models, this strategy has not proven effective for
vision-centric encoder models. To address this challenge, we introduce a novel
formulation of self-supervised fine-tuning for vision foundation models, where
the model is adapted to a new domain without requiring annotations, leveraging
only short multi-view object-centric videos. Our method is referred to as
VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual
foundation models. VESSA's training technique is based on a self-distillation
paradigm, where it is critical to carefully tune prediction heads and deploy
parameter-efficient adaptation techniques - otherwise, the model may quickly
forget its pretrained knowledge and reach a degraded state. VESSA benefits
significantly from multi-view object observations sourced from different frames
in an object-centric video, efficiently learning robustness to varied capture
conditions, without the need of annotations. Through comprehensive experiments
with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent
improvements in downstream classification tasks, compared to the base models
and previous adaptation methods. Code is publicly available at
https://github.com/jesimonbarreto/VESSA.
Authors (4)
Jesimon Barreto
Carlos Caetano
AndrΓ© Araujo
William Robson Schwartz
Submitted
October 23, 2025
Key Contributions
VESSA introduces a novel object-centric self-supervised fine-tuning method for vision foundation models, enabling adaptation to new domains using only short multi-view videos without annotations. It leverages a self-distillation paradigm, proving effective for vision encoder models where prior SSL adaptation methods struggled, thus addressing domain shift challenges with limited data.
Business Value
Allows for rapid and cost-effective adaptation of powerful vision models to specific industry needs or new environments, reducing the reliance on large, labeled datasets and accelerating deployment.