Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 93% Match Research Paper Computer vision researchers,NLP researchers,ML engineers,Developers of multimodal AI systems 1 week ago

GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

large-language-models › multimodal-llms
📄 Abstract

Abstract: Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.
Authors (11)
Guanghao Zheng
Bowen Shi
Mingxing Xu
Ruoyu Sun
Peisen Zhao
Zhibo Zhang
+5 more
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

GranViT introduces a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to LLMs via region-level autoregressive training. It addresses the limitations of existing vision encoders by focusing on granular perception and proposes Gran-29M, a large dataset with extensive region-level annotations, to enable large-scale fine-grained pre-training.

Business Value

Enables more sophisticated visual understanding in AI applications, leading to improved performance in areas like image search, content moderation, and visual assistance systems.