Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multimodal reasoning requires iterative coordination between language and
vision, yet it remains unclear what constitutes a meaningful interleaved chain
of thought. We posit that text and image thoughts should function as
complementary, rather than isomorphic, modalities that mutually advance
reasoning. Guided by this principle, we build ThinkMorph, a unified model
fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with
varying visual engagement. ThinkMorph learns to generate progressive text-image
reasoning steps that concretely manipulate visual content while maintaining
coherent verbal logic. It delivers large gains on vision-centric benchmarks
(averaging 34.7% over the base model) and generalizes to out-of-domain tasks,
matching or surpassing larger and proprietary VLMs. Beyond performance,
ThinkMorph exhibits emergent multimodal intelligence, including unseen visual
manipulation skills, adaptive switching between reasoning modes, and better
test-time scaling through diversified multimodal thoughts.These findings
suggest promising directions for characterizing the emergent capabilities of
unified models for multimodal reasoning.
Authors (8)
Jiawei Gu
Yunzhuo Hao
Huichen Will Wang
Linjie Li
Michael Qizhe Shieh
Yejin Choi
+2 more
Submitted
October 30, 2025
Key Contributions
ThinkMorph introduces a novel approach to multimodal reasoning by positing that text and image thoughts should be complementary. It learns to generate progressive, interleaved text-image reasoning steps that concretely manipulate visual content while maintaining verbal logic, exhibiting emergent multimodal intelligence and achieving significant gains on vision-centric benchmarks.
Business Value
Enables more sophisticated AI assistants and tools that can understand and interact with visual information in a human-like reasoning process, leading to more intuitive and powerful applications.