Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Vision encoders are indispensable for allowing impressive performance of
Multi-modal Large Language Models (MLLMs) in vision language tasks such as
visual question answering and reasoning. However, existing vision encoders
focus on global image representations but overlook fine-grained regional
analysis. They are limited in fine grained perception due to the scarcity of
fine grained annotated data and the lack of a fine grained pre-training
paradigm. In this paper, we propose GranViT, a novel Vision Transformer that
integrates fine-grained feature extraction with semantic alignment to Large
Language Models (LLMs) via region level autoregressive training. We first
construct Gran-29M, a dataset comprising 2million natural and OCR images paired
with over 180 million high-quality region-level annotations, to enable large
scale fine grained pretraining. Consequently, we develop a
pretraining-adaptation framework along with a self distillation mechanism to
train fine-grained GranViT on Gran-29M. We sufficiently exploit the
fine-grained annotations from Gran-29M to resort to bounding-box-to-caption
regression to enhance localized visual representation of the vision encoder in
the pretraining and caption-to-bounding-box regression to improve vision
feature utilization and localization for LLM in the adaptation. We further
incorporate a self distillation mechanism that imposes explicit localization
constraints on the vision encoder to strengthen its regional reasoning
capability. Extensive experiments show that GranViT surpasses existing vision
encoders and attains strong transferability to varying LLMs. Remarkably, it
achieves state-of-the-art results on fine-grained recognition, multimodal VQA,
and OCR understanding.
Authors (11)
Guanghao Zheng
Bowen Shi
Mingxing Xu
Ruoyu Sun
Peisen Zhao
Zhibo Zhang
+5 more
Submitted
October 23, 2025
Key Contributions
GranViT introduces a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to LLMs via region-level autoregressive training. It addresses the limitations of existing vision encoders by focusing on granular perception and proposes Gran-29M, a large dataset with extensive region-level annotations, to enable large-scale fine-grained pre-training.
Business Value
Enables more sophisticated visual understanding in AI applications, leading to improved performance in areas like image search, content moderation, and visual assistance systems.