Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify
visual comprehension and generation. However, these two capabilities remain
largely independent, as if they are two separate functions encapsulated within
the same model. Consequently, visual comprehension does not enhance visual
generation, and the reasoning mechanisms of LLMs have not been fully integrated
to revolutionize image generation. In this paper, we propose to enable the
collaborative co-evolution of visual comprehension and generation, advancing
image generation into an iterative introspective process. We introduce a
two-stage training approach: supervised fine-tuning teaches the MLLM with the
foundational ability to generate genuine CoT for visual generation, while
reinforcement learning activates its full potential via an
exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in
visual generation, advancing MLLMs from text-to-image tasks to unified image
generation. Extensive experiments demonstrate that our model not only excels in
text-to-image generation and image editing, but also functions as a superior
image semantic evaluator with enhanced visual comprehension capabilities.
Project Page: https://janus-pro-r1.github.io.
Authors (12)
Kaihang Pan
Yang Wu
Wendong Bu
Kai Shen
Juncheng Li
Yingting Wang
+6 more
Key Contributions
Introduces a framework for collaborative co-evolution of visual comprehension and generation in MLLMs, advancing image generation into an iterative introspective process. Utilizes a two-stage training approach (SFT + RL) to enable genuine CoT for visual generation and unlock the 'Aha moment' in image generation.
Business Value
Enables the creation of more intelligent and creative AI systems capable of generating images with deeper understanding and reasoning, leading to more novel applications in art, design, and content creation.