arxiv_ai 93% Match Research Paper Computer vision researchers,NLP researchers,ML engineers,Developers of multimodal AI systems 1 week ago

GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.

Authors (11)

Guanghao Zheng

Bowen Shi

Mingxing Xu

Ruoyu Sun

Peisen Zhao

Zhibo Zhang

+5 more

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

GranViT introduces a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to LLMs via region-level autoregressive training. It addresses the limitations of existing vision encoders by focusing on granular perception and proposes Gran-29M, a large dataset with extensive region-level annotations, to enable large-scale fine-grained pre-training.

Business Value

Enables more sophisticated visual understanding in AI applications, leading to improved performance in areas like image search, content moderation, and visual assistance systems.

Paper Metadata

Innovation Type

Model Architecture and Dataset

Deployment Feasibility

Medium, requires significant computational resources for training and inference due to the nature of large models and fine-grained processing.

Limitations Addressed

Existing vision encoders' focus on global image representations, overlooking fine-grained regional analysis, and the scarcity of fine-grained annotated data.

Performance Gains

Improved fine-grained perception capabilities for MLLMs, leading to better performance on tasks requiring detailed visual understanding.

Technical Tags

Granular VisionVision TransformersMultimodal LLMsFine-grained PerceptionAutoregressive PerceptionRegion-level AnnotationsPre-trainingSelf-distillationOCRVisual Question AnsweringVisual Reasoning

Research Topics

Computer VisionNatural Language ProcessingMultimodal LearningLarge Language ModelsDeep Learning Architectures

Methods & Architectures

Gran-level feature extractionSemantic alignmentRegion-level autoregressive trainingSelf-distillation mechanismPre-training-adaptation framework Vision Transformer (ViT)Multimodal Large Language Models (MLLMs)

Applications & Tasks

Computer Vision Natural Language Processing Multimodal AI Limited fine-grained regional analysis in existing vision encodersScarcity of fine-grained annotated dataLack of fine-grained pre-training paradigms Visual Question Answering (VQA)Visual ReasoningImage understanding requiring detailed object/region analysis

Datasets & Benchmarks

Datasets

Gran-29M

Performance on VQAPerformance on visual reasoning tasksFine-grained perception accuracy

Related Fields

Computer VisionNatural Language ProcessingDeep LearningMultimodal AI

Keywords

GranViTVision TransformerMultimodal LLMsFine-grained VisionAutoregressiveRegion-levelPre-trainingDatasetVQAVisual ReasoningComputer VisionNLPSelf-distillationGran-29M

Academic Context

#Computer Vision#Natural Language Processing#Multimodal Learning#Large Language Models#Deep Learning Architectures

Technology Stack

Frameworks & Libraries

Vision Transformer

Commercial Potential

Potential Products

Advanced image analysis toolsVisually-aware chatbotsContent moderation systemsRobotic vision systems

Target Industries

TechnologyE-commerceMediaRoboticsSecurity

Use Case Examples

Answering detailed questions about specific objects or regions within an image.Performing complex visual reasoning tasks that require understanding fine details.Improving the accuracy of object detection and recognition.

Competitive Edge

Offers a specialized vision encoder for MLLMs that excels at fine-grained perception, addressing a key limitation of current global-representation-focused encoders.

Market Opportunity

Large and growing market for multimodal AI and advanced computer vision solutions.

Revenue Models

Licensing of the model/technologyintegration into AI platforms and services.

Resource Requirements

Compute Needs

Very High, for pre-training on large datasets and for inference.

Data Requirements

Requires a large-scale dataset with detailed region-level annotations (Gran-29M).

Deployment Constraints

Computational cost and memory requirements for fine-grained processing.

Scalability

Scalability is a challenge due to the computational demands of fine-grained analysis, but the architecture itself is designed for large models.

Production Readiness

Maturity Level

Research

Time to Market

Medium to Long, requires integration into larger MLLM systems and further validation.

Patent Potential

Medium, related to the novel architecture and training methodology for fine-grained vision.

View Full Paper Back to Papers