Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 92% Match Research Paper Computer vision researchers,Robotics engineers,VR/AR developers,HCI researchers 1 month ago

Cascaded Diffusion Framework for Probabilistic Coarse-to-Fine Hand Pose Estimation

computer-vision › 3d-vision
📄 Abstract

Abstract: Deterministic models for 3D hand pose reconstruction, whether single-staged or cascaded, struggle with pose ambiguities caused by self-occlusions and complex hand articulations. Existing cascaded approaches refine predictions in a coarse-to-fine manner but remain deterministic and cannot capture pose uncertainties. Recent probabilistic methods model pose distributions yet are restricted to single-stage estimation, which often fails to produce accurate 3D reconstructions without refinement. To address these limitations, we propose a coarse-to-fine cascaded diffusion framework that combines probabilistic modeling with cascaded refinement. The first stage is a joint diffusion model that samples diverse 3D joint hypotheses, and the second stage is a Mesh Latent Diffusion Model (Mesh LDM) that reconstructs a 3D hand mesh conditioned on a joint sample. By training Mesh LDM with diverse joint hypotheses in a learned latent space, our framework learns distribution-aware joint-mesh relationships and robust hand priors. Furthermore, the cascaded design mitigates the difficulty of directly mapping 2D images to dense 3D poses, enhancing accuracy through sequential refinement. Experiments on FreiHAND and HO3Dv2 demonstrate that our method achieves state-of-the-art performance while effectively modeling pose distributions.

Key Contributions

Proposes a coarse-to-fine cascaded diffusion framework for probabilistic 3D hand pose estimation, combining probabilistic modeling with cascaded refinement. It uses a joint diffusion model to sample hypotheses and a Mesh LDM to reconstruct a 3D mesh, effectively addressing pose ambiguities and capturing uncertainties.

Business Value

Enables more realistic and robust human-hand interaction in virtual and augmented reality, improves robotic manipulation capabilities by providing better hand tracking, and aids in clinical analysis of hand movements.