Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 90% Match Research Paper AI Researchers,Computer Vision Engineers,Machine Learning Practitioners 1 week ago

More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

computer-vision › diffusion-models
📄 Abstract

Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degra- dation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play- and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and im- prove the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of- the-art performance across multiple depth estimation benchmarks. The code will be made available at https://github.com/H-EmbodVis/MERGE
Authors (5)
Hongkai Lin
Dingkang Liang
Mingyang Du
Xin Zhou
Xiang Bai
Submitted
October 27, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

MERGE unifies image generation and depth estimation using pre-trained text-to-image diffusion models without catastrophic degradation. It introduces a plug-and-play framework for mode switching and a Group Reuse Mechanism for efficient parameter utilization, demonstrating that diffusion models can extend beyond generation to tasks like depth estimation.

Business Value

Enables more versatile and efficient use of powerful pre-trained generative models for tasks beyond simple image creation, potentially reducing development costs for applications requiring both image synthesis and spatial understanding.