Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal large language models (MLLMs) integrate image features from visual
encoders with LLMs, demonstrating advanced comprehension capabilities. However,
mainstream MLLMs are solely supervised by the next-token prediction of textual
tokens, neglecting critical vision-centric information essential for analytical
abilities. To track this dilemma, we introduce VaCo, which optimizes MLLM
representations through Vision-Centric activation and Coordination from
multiple vision foundation models (VFMs). VaCo introduces visual discriminative
alignment to integrate task-aware perceptual features extracted from VFMs,
thereby unifying the optimization of both textual and visual outputs in MLLMs.
Specifically, we incorporate the learnable Modular Task Queries (MTQs) and
Visual Alignment Layers (VALs) into MLLMs, activating specific visual signals
under the supervision of diverse VFMs. To coordinate representation conflicts
across VFMs, the crafted Token Gateway Mask (TGM) restricts the information
flow among multiple groups of MTQs. Extensive experiments demonstrate that VaCo
significantly improves the performance of different MLLMs on various
benchmarks, showcasing its superior capabilities in visual comprehension.
Authors (7)
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
+1 more
Submitted
October 16, 2025
Key Contributions
VaCo introduces a novel approach to optimize MLLM representations by incorporating vision-centric information from multiple Vision Foundation Models (VFMs). It uses visual discriminative alignment, Modular Task Queries (MTQs), and Visual Alignment Layers (VALs) to unify the optimization of both textual and visual outputs, addressing the neglect of vision-centric details in standard MLLM training.
Business Value
Enhances the capabilities of AI assistants and multimodal search engines, enabling more nuanced understanding and interaction with visual content, leading to richer user experiences.