Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Computer Vision Researchers,3D Graphics Engineers,Machine Learning Engineers 2 weeks ago

CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

computer-vision › 3d-vision
📄 Abstract

Abstract: This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.
Authors (6)
Binbin Huang
Haobin Duan
Yiqun Zhao
Zibo Zhao
Yi Ma
Shenghua Gao
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

CUPID proposes a novel generation-based 3D reconstruction method that unifies pose, shape, and texture estimation from a single image. It casts reconstruction as a conditional sampling process and uses a two-stage flow matching pipeline to generate voxels and correspondences, achieving state-of-the-art performance.

Business Value

Enables creation of detailed 3D assets from 2D images, valuable for AR/VR content creation, game development, and product visualization.