arxiv_cv 97% Match Research Paper AI researchers,Developers of generative models,Machine learning engineers 2 weeks ago

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

large-language-models › multimodal-llms

📄 Abstract

Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: https://janus-pro-r1.github.io.

Authors (12)

Kaihang Pan

Yang Wu

Wendong Bu

Kai Shen

Juncheng Li

Yingting Wang

+6 more

Submitted

June 2, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces a framework for collaborative co-evolution of visual comprehension and generation in MLLMs, advancing image generation into an iterative introspective process. Utilizes a two-stage training approach (SFT + RL) to enable genuine CoT for visual generation and unlock the 'Aha moment' in image generation.

Business Value

Enables the creation of more intelligent and creative AI systems capable of generating images with deeper understanding and reasoning, leading to more novel applications in art, design, and content creation.

Paper Metadata

Innovation Type

Training Methodology

Deployment Feasibility

Requires advanced MLLM architectures and significant computational resources for training, but the resulting models could be deployed in various creative tools.

Limitations Addressed

The largely independent nature of visual comprehension and generation capabilities in current MLLMs, and the lack of deep integration of LLM reasoning into the image generation process.

Performance Gains

Advances MLLMs from text-to-image tasks to unified image generation, unlocking potential for more sophisticated and reasoned visual outputs.

Technical Tags

multimodal large language modelsvisual comprehensionvisual generationcollaborative evolutionreinforcement learningsupervised fine-tuningchain-of-thoughtiterative introspectiontext-to-image

Research Topics

Multimodal AILarge Language ModelsGenerative ModelsReinforcement LearningAI Alignment

Methods & Architectures

Two-stage training (supervised fine-tuning + reinforcement learning)Chain-of-Thought (CoT) generationExploration-exploitation trade-off Multimodal Large Language Models (MLLMs)

Applications & Tasks

Image Generation Creative AI Content Creation Independence of visual comprehension and generationLimited integration of LLM reasoning into image generationStatic text-to-image generation Unified image generationCollaborative visual comprehension and generationImage generation with enhanced reasoning

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingComputer VisionReinforcement Learning

Keywords

multimodal LLMvision-languageimage generationvisual comprehensionreinforcement learningsupervised fine-tuningchain-of-thoughtiterative generationtext-to-imagegenerative AIAI reasoning

Academic Context

#Multimodal AI#Large Language Models#Generative Models#Reinforcement Learning#AI Alignment

Commercial Potential

Potential Products

Advanced AI art generatorsCreative content creation toolsAI assistants for design and storytelling

Target Industries

Media and EntertainmentAdvertisingGamingArt and Design

Use Case Examples

Generating images that accurately reflect complex textual descriptions with reasoningCreating visual narratives that evolve based on AI's understandingDeveloping AI systems that can explain their image generation process

Competitive Edge

Aims to push the boundaries of MLLMs by enabling true collaboration between comprehension and generation, moving beyond simple text-to-image synthesis.

Market Opportunity

The MLLM market is rapidly expanding, with significant interest in advanced generative capabilities.

Revenue Models

Licensing of advanced modelsAPI accessintegration into creative software suites.

Resource Requirements

Compute Needs

Requires substantial computational resources for training large multimodal models, especially with RL fine-tuning.

Data Requirements

Requires large datasets of image-text pairs, potentially with reasoning chains or structured explanations.

Deployment Constraints

The complexity of the model and training process might make deployment challenging in resource-constrained environments.

Scalability

Scalability is a key aspect of LLMs, and this approach aims to enhance the capabilities of scalable multimodal architectures.

Production Readiness

Maturity Level

Research

Time to Market

Medium to long term, due to the complexity of training and potential need for specialized hardware.

View Full Paper Back to Papers