Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In recent times, the standard practice for developing MLLMs is to feed
features from vision encoder(s) into the LLM and train with natural language
supervision. This approach often causes models to lean towards language
comprehension and undermine the rich visual perception signals present in the
data, which are critical for tasks involving spatial reasoning in the domain of
embodied AI and robotics. Is it possible to optimize both at the same time? In
this work, we propose VisPer-LM, the first approach that infuses visual
perception knowledge from expert vision encoders into the LLM's (of an MLLM)
hidden representations. We start by investigating MLLMs trained solely with
natural language supervision and identify a positive correlation between the
quality of visual representations within these models and their downstream
performance. Given this insight, we formulate the objective during the
pretraining stage in MLLMs as a coupled optimization of predictive visual
embedding and next (text) token prediction. Moreover, through extensive
probing, we observe improved visual representation quality due to embedding
optimization, underscoring the effectiveness of our probing setup. We
demonstrate that our VisPer-LM outperforms the single and multi-encoder
baselines, proving our approach's superiority over explicitly feeding the
corresponding features to the LLM. In particular, VisPer-LM boosts performance
by an average margin of up to 2.5% on various benchmarks, with a notable
improvement of 8.7% on the Depth task in CV-Bench.
Authors (5)
Jitesh Jain
Zhengyuan Yang
Humphrey Shi
Jianfeng Gao
Jianwei Yang
Submitted
December 12, 2024
Key Contributions
This paper proposes VisPer-LM, the first approach to infuse visual perception knowledge from expert vision encoders into an LLM's hidden representations for MLLMs. It addresses the common issue where MLLMs trained with natural language supervision often neglect rich visual signals, which are critical for tasks requiring spatial reasoning, by formulating a coupled optimization objective during pretraining.
Business Value
Enhancing MLLMs with stronger visual perception can lead to more capable AI agents in robotics and embodied AI, enabling them to better understand and interact with the physical world, potentially improving automation in logistics, manufacturing, and autonomous systems.