Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Modern computer vision is converging on a closed loop in which perception,
reasoning and generation mutually reinforce each other. However, this loop
remains incomplete: the top-down influence of high-level reasoning on the
foundational learning of low-level perceptual features is not yet
underexplored. This paper addresses this gap by proposing a new paradigm for
pretraining foundation models in downstream domains. We introduce Visual
insTruction Pretraining (ViTP), a novel approach that directly leverages
reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT)
backbone within a Vision-Language Model and pretrains it end-to-end using a
rich corpus of visual instruction data curated from target downstream domains.
ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels
the ViT to learn robust and domain-relevant features from a sparse set of
visual tokens. Extensive experiments on 16 challenging remote sensing and
medical imaging benchmarks demonstrate that ViTP establishes new
state-of-the-art performance across a diverse range of downstream tasks. The
code is available at https://github.com/zcablii/ViTP.