arxiv_cv 95% Match Research Paper Computer Vision Researchers,3D Graphics Engineers,Machine Learning Engineers 2 weeks ago

CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

computer-vision › 3d-vision

📄 Abstract

Abstract: This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.

Authors (6)

Binbin Huang

Haobin Duan

Yiqun Zhao

Zibo Zhao

Yi Ma

Shenghua Gao

Submitted

October 23, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

CUPID proposes a novel generation-based 3D reconstruction method that unifies pose, shape, and texture estimation from a single image. It casts reconstruction as a conditional sampling process and uses a two-stage flow matching pipeline to generate voxels and correspondences, achieving state-of-the-art performance.

Business Value

Enables creation of detailed 3D assets from 2D images, valuable for AR/VR content creation, game development, and product visualization.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Requires significant computational resources for training and inference, but the output can be used in various downstream applications.

Limitations Addressed

Existing methods often struggle with robust pose and shape estimation from a single image, especially under a unified framework. CUPID addresses this by jointly generating voxels and correspondences within a shared latent space.

Performance Gains

Over 3 dB PSNR gain, over 10% Chamfer Distance reduction compared to leading methods.

Technical Tags

3D reconstructiongenerative modelspose estimationshape estimationtexture generationlatent spaceflow matchingvoxel generationpixel-voxel correspondenceconditional sampling

Research Topics

3D Computer VisionGenerative ModelingImage-to-3D ReconstructionDeep LearningGeometric Deep Learning

Methods & Architectures

Conditional samplingFlow matchingVoxel generationPixel-voxel correspondenceLatent space representation Generative modelTwo-stage pipeline

Applications & Tasks

Computer Graphics Robotics Augmented Reality Virtual Reality 3D Shape and Texture ReconstructionPose EstimationSingle-Image 3D Understanding 3D Reconstruction from Single ImageCamera Pose EstimationObject Shape and Texture Generation

Related Fields

Computer VisionComputer GraphicsMachine LearningDeep LearningGenerative Models

Keywords

3D reconstructionsingle imagegenerative modelpose estimationshape estimationtexture generationlatent spaceflow matchingvoxelcorrespondenceconditional samplingcomputer visiondeep learning

Academic Context

#3D Computer Vision#Generative Modeling#Image-to-3D Reconstruction#Deep Learning#Geometric Deep Learning

Commercial Potential

Potential Products

3D asset generation toolsAR/VR content creation platformsVirtual try-on solutions

Target Industries

GamingEntertainmentE-commerceArchitectureManufacturing

Use Case Examples

Generating 3D models of objects from photos for online catalogs.Creating 3D environments for virtual reality experiences.

Competitive Edge

Outperforms existing leading 3D reconstruction methods in terms of accuracy and robustness, offering a unified generative framework.

Market Opportunity

Growing market for 3D content creation and AR/VR applications.

Revenue Models

Licensing of technologySaaS for 3D generation services.

Resource Requirements

Compute Needs

High (for training and inference)

Data Requirements

Large datasets of 2D images with corresponding 3D models and poses.

Deployment Constraints

Computational cost, potential need for specialized hardware.

Scalability

Scalability depends on the generative model's capacity and the efficiency of the flow matching pipeline.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate (novel generative approach)

View Full Paper Back to Papers