arxiv_cv 90% Match Research Paper AI Researchers,Computer Vision Engineers,Machine Learning Practitioners 1 week ago

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

computer-vision › diffusion-models

📄 Abstract

Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degra- dation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play- and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and im- prove the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of- the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE

Authors (5)

Hongkai Lin

Dingkang Liang

Mingyang Du

Xin Zhou

Xiang Bai

Submitted

October 27, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

MERGE unifies image generation and depth estimation using pre-trained text-to-image diffusion models without catastrophic degradation. It introduces a plug-and-play framework for mode switching and a Group Reuse Mechanism for efficient parameter utilization, demonstrating that diffusion models can extend beyond generation to tasks like depth estimation.

Business Value

Enables more versatile and efficient use of powerful pre-trained generative models for tasks beyond simple image creation, potentially reducing development costs for applications requiring both image synthesis and spatial understanding.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

Moderate. Requires significant computational resources typical of diffusion models, but the unified framework could simplify deployment pipelines.

Limitations Addressed

Catastrophic degradation of image generation capability when fine-tuning diffusion models for new tasks like depth estimation.

Technical Tags

diffusion modelsdepth estimationimage generationtext-to-imageparameter reuseplug-and-playzero-shotgenerative models

Research Topics

Generative ModelsComputer VisionDeep LearningMultimodal Learning

Methods & Architectures

text-to-image diffusion modelsplug-and-play frameworkGroup Reuse Mechanism Diffusion Models

Applications & Tasks

Computer Vision 3D Reconstruction Image Synthesis Catastrophic forgettingUnifying tasksParameter degradation Image GenerationDepth Estimation

Related Fields

Machine LearningArtificial IntelligenceComputer Graphics

Keywords

diffusion modelsdepth estimationimage generationtext-to-imageunified modelparameter reuseplug-and-playzero-shot learninggenerative AIcomputer visiondeep learningmultimodal learning

Academic Context

#Generative Models#Computer Vision#Deep Learning#Multimodal Learning

Commercial Potential

Potential Products

3D modeling toolsAugmented reality applicationsVirtual reality content creation

Target Industries

GamingFilm and EntertainmentArchitectureRobotics

Use Case Examples

Generating realistic depth maps from text promptsCreating 3D scenes with consistent generation and depth information

Competitive Edge

Offers a unified approach to tasks often handled by separate models, potentially outperforming specialized models by leveraging shared representations and avoiding catastrophic forgetting.

Market Opportunity

Large and growing market for generative AI and 3D content creation.

Revenue Models

Licensing of technologySaaS for creative tools.

Resource Requirements

Compute Needs

High (typical for diffusion models)

Data Requirements

Requires datasets for both image generation and depth estimation for training/fine-tuning.

Deployment Constraints

Computational cost, memory footprint.

Scalability

Scalability depends on the underlying diffusion model architecture and the efficiency of the Group Reuse Mechanism.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate (novel architectural components and unification approach)

View Full Paper Back to Papers