Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce a novel framework for metric depth estimation that enhances
pretrained diffusion-based monocular depth estimation (DB-MDE) models with
stereo vision guidance. While existing DB-MDE methods excel at predicting
relative depth, estimating absolute metric depth remains challenging due to
scale ambiguities in single-image scenarios. To address this, we reframe depth
estimation as an inverse problem, leveraging pretrained latent diffusion models
(LDMs) conditioned on RGB images, combined with stereo-based geometric
constraints, to learn scale and shift for accurate depth recovery. Our
training-free solution seamlessly integrates into existing DB-MDE frameworks
and generalizes across indoor, outdoor, and complex environments. Extensive
experiments demonstrate that our approach matches or surpasses state-of-the-art
methods, particularly in challenging scenarios involving translucent and
specular surfaces, all without requiring retraining.
Authors (4)
Tuan Pham
Thanh-Tung Le
Xiaohui Xie
Stephan Mandt
Submitted
October 21, 2025
Key Contributions
Introduces GeoDiff, a framework that enhances diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance for metric depth estimation. It reframes depth estimation as an inverse problem, leveraging LDMs and stereo constraints in a training-free manner.
Business Value
Enables more accurate 3D perception for robots and autonomous systems, improving navigation, object interaction, and scene understanding, especially in complex visual conditions. Reduces reliance on stereo cameras in some applications.