Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Robotics Engineers,Computer Vision Engineers,AR/VR Developers,Edge AI Practitioners 3 weeks ago

Online Video Depth Anything: Temporally-Consistent Depth Prediction with Low Memory Consumption

computer-vision › 3d-vision
📄 Abstract

Abstract: Depth estimation from monocular video has become a key component of many real-world computer vision systems. Recently, Video Depth Anything (VDA) has demonstrated strong performance on long video sequences. However, it relies on batch-processing which prohibits its use in an online setting. In this work, we overcome this limitation and introduce online VDA (oVDA). The key innovation is to employ techniques from Large Language Models (LLMs), namely, caching latent features during inference and masking frames at training. Our oVDA method outperforms all competing online video depth estimation methods in both accuracy and VRAM usage. Low VRAM usage is particularly important for deployment on edge devices. We demonstrate that oVDA runs at 42 FPS on an NVIDIA A100 and at 20 FPS on an NVIDIA Jetson edge device. We will release both, code and compilation scripts, making oVDA easy to deploy on low-power hardware.

Key Contributions

This paper presents online Video Depth Anything (oVDA), enabling temporally-consistent depth prediction from monocular video in an online setting with low memory consumption. It innovatively applies LLM techniques like latent feature caching and frame masking to overcome the batch-processing limitations of previous methods, making it suitable for edge devices.

Business Value

Enables real-time, accurate depth perception for applications on resource-constrained devices like drones, mobile robots, and AR/VR headsets, significantly expanding the possibilities for mobile AI applications.

View Code on GitHub