Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Vision-Language Models (LVLMs) have achieved significant success in
multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing
performance and interpretability. Recent MCoT methods fall into two categories:
(i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual
output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved
image-text outputs. Despite advances in both approaches, the mechanisms driving
these improvements are not fully understood. To fill this gap, we first reveal
that MCoT boosts LVLMs by incorporating visual thoughts, which convey image
information to the reasoning process regardless of the MCoT format, depending
only on clarity and conciseness of expression. Furthermore, to explore visual
thoughts systematically, we define four distinct forms of visual thought
expressions and analyze them comprehensively. Our findings demonstrate that
these forms differ in clarity and conciseness, yielding varying levels of MCoT
improvement. Additionally, we explore the internal nature of visual thoughts,
finding that visual thoughts serve as intermediaries between the input image
and reasoning to deeper transformer layers, enabling more advanced visual
information transmission. We hope that the visual thoughts can inspire further
breakthroughs for future MCoT research.
Authors (11)
Zihui Cheng
Qiguang Chen
Xiao Xu
Jiaqi Wang
Weiyun Wang
Hao Fei
+5 more
Key Contributions
Provides a unified perspective on Multimodal Chain-of-Thought (MCoT) by revealing that MCoT enhances LVLMs by incorporating 'visual thoughts' regardless of format. It defines and analyzes four forms of visual thought expressions based on clarity and conciseness.
Business Value
Improves the interpretability and effectiveness of multimodal AI systems, leading to more reliable and understandable AI applications in areas like image analysis and content generation.