Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Researchers in LLMs and multimodal AI,Developers of AI assistants,Computer vision engineers 2 weeks ago

Vision-Centric Activation and Coordination for Multimodal Large Language Models

large-language-models › multimodal-llms
📄 Abstract

Abstract: Multimodal large language models (MLLMs) integrate image features from visual encoders with LLMs, demonstrating advanced comprehension capabilities. However, mainstream MLLMs are solely supervised by the next-token prediction of textual tokens, neglecting critical vision-centric information essential for analytical abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM representations through Vision-Centric activation and Coordination from multiple vision foundation models (VFMs). VaCo introduces visual discriminative alignment to integrate task-aware perceptual features extracted from VFMs, thereby unifying the optimization of both textual and visual outputs in MLLMs. Specifically, we incorporate the learnable Modular Task Queries (MTQs) and Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals under the supervision of diverse VFMs. To coordinate representation conflicts across VFMs, the crafted Token Gateway Mask (TGM) restricts the information flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo significantly improves the performance of different MLLMs on various benchmarks, showcasing its superior capabilities in visual comprehension.
Authors (7)
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
+1 more
Submitted
October 16, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

VaCo introduces a novel approach to optimize MLLM representations by incorporating vision-centric information from multiple Vision Foundation Models (VFMs). It uses visual discriminative alignment, Modular Task Queries (MTQs), and Visual Alignment Layers (VALs) to unify the optimization of both textual and visual outputs, addressing the neglect of vision-centric details in standard MLLM training.

Business Value

Enhances the capabilities of AI assistants and multimodal search engines, enabling more nuanced understanding and interaction with visual content, leading to richer user experiences.