arxiv_cv 95% Match Research Paper Computer Vision Researchers,Robotics Engineers,Autonomous Systems Developers 2 weeks ago

GeoDiff: Geometry-Guided Diffusion for Metric Depth Estimation

computer-vision › diffusion-models

📄 Abstract

Abstract: We introduce a novel framework for metric depth estimation that enhances pretrained diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance. While existing DB-MDE methods excel at predicting relative depth, estimating absolute metric depth remains challenging due to scale ambiguities in single-image scenarios. To address this, we reframe depth estimation as an inverse problem, leveraging pretrained latent diffusion models (LDMs) conditioned on RGB images, combined with stereo-based geometric constraints, to learn scale and shift for accurate depth recovery. Our training-free solution seamlessly integrates into existing DB-MDE frameworks and generalizes across indoor, outdoor, and complex environments. Extensive experiments demonstrate that our approach matches or surpasses state-of-the-art methods, particularly in challenging scenarios involving translucent and specular surfaces, all without requiring retraining.

Authors (4)

Tuan Pham

Thanh-Tung Le

Xiaohui Xie

Stephan Mandt

Submitted

October 21, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces GeoDiff, a framework that enhances diffusion-based monocular depth estimation (DB-MDE) models with stereo vision guidance for metric depth estimation. It reframes depth estimation as an inverse problem, leveraging LDMs and stereo constraints in a training-free manner.

Business Value

Enables more accurate 3D perception for robots and autonomous systems, improving navigation, object interaction, and scene understanding, especially in complex visual conditions. Reduces reliance on stereo cameras in some applications.

Paper Metadata

Innovation Type

Framework Enhancement

Deployment Feasibility

Moderate, requires integration with existing DB-MDE frameworks and potentially stereo data for guidance.

Limitations Addressed

The challenge of estimating absolute metric depth from single images due to scale ambiguities, and the difficulty of handling translucent and specular surfaces with existing DB-MDE methods.

Technical Tags

metric depth estimationdiffusion modelsmonocular depth estimation (MDE)stereo vision guidancescale ambiguityinverse problemlatent diffusion models (LDMs)training-freetranslucent surfacesspecular surfaces

Research Topics

3D Computer VisionGenerative ModelsDepth EstimationGeometric Computer Vision

Methods & Architectures

diffusion model enhancementstereo guidanceinverse problem formulationtraining-free integration Diffusion-based Monocular Depth Estimation (DB-MDE)Latent Diffusion Models (LDMs)

Applications & Tasks

Robotics Autonomous Driving Augmented Reality 3D Reconstruction Metric Depth EstimationScale Ambiguity ResolutionMonocular Vision Challenges Estimating absolute metric depth from single imagesImproving depth estimation accuracy in challenging scenarios

Datasets & Benchmarks

Benchmarks

matches or surpasses state-of-the-art methods

Related Fields

Computer VisionMachine LearningGenerative AIRobotics

Keywords

depth estimationmetric depthdiffusion modelsmonocular depthstereo visionscale ambiguityLDMtraining-freeGeoDiffcomputer vision3D perception

Academic Context

#3D Computer Vision#Generative Models#Depth Estimation#Geometric Computer Vision

Commercial Potential

Potential Products

Depth estimation libraries for roboticsEnhanced perception modules for AR/VR

Target Industries

AutomotiveRoboticsGamingManufacturingLogistics

Use Case Examples

Enabling robots to accurately measure distances to objects for manipulationImproving AR experiences by precisely overlaying virtual objects onto the real world

Competitive Edge

Offers a training-free method to enhance existing diffusion-based monocular depth estimators with metric accuracy using stereo guidance.

Market Opportunity

Large market for accurate 3D perception technologies.

Revenue Models

Licensing of algorithmsintegration services.

Resource Requirements

Compute Needs

Moderate for inference, potentially high for diffusion model training.

Data Requirements

RGB images, potentially paired with stereo data for guidance.

Deployment Constraints

Integration complexity, computational resources for diffusion models.

Scalability

Scalability depends on the underlying diffusion model and stereo guidance integration.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for integration

Licensing

TBD.

Patent Potential

Moderate for the GeoDiff framework and stereo guidance integration.

View Full Paper Back to Papers