📄 Abstract
Abstract: Despite remarkable progress having been made on the problem of 3D human pose
and shape estimation (HPS), current state-of-the-art methods rely heavily on
either confined indoor mocap datasets or datasets generated by a rendering
engine using computer graphics (CG). Both categories of datasets exhibit
inadequacies in furnishing adequate human identities and authentic in-the-wild
background scenes, which are crucial for accurately simulating real-world
distributions. In this work, we show that synthetic data created by generative
models is complementary to CG-rendered data for achieving remarkable
generalization performance on diverse real-world scenes. We propose an
effective data generation pipeline based on recent diffusion models, termed
HumanWild, which can effortlessly generate human images and corresponding 3D
mesh annotations. Specifically, we first collect a large-scale human-centric
dataset with comprehensive annotations, e.g, text captions, the depth map, and
surface normal images. To generate a wide variety of human images with initial
labels, we train a customized, multi-condition ControlNet model. The key to
this process is using a 3D parametric model, e.g, SMPL-X, to create various
condition inputs easily. Our data generation pipeline is both flexible and
customizable, making it adaptable to multiple real-world tasks, such as human
interaction in complex scenes and humans captured by wide-angle lenses. By
relying solely on generative models, we can produce large-scale, in-the-wild
human images with high-quality annotations, significantly reducing the need for
manual image collection and annotation. The generated dataset encompasses a
wide range of viewpoints, environments, and human identities, ensuring its
versatility across different scenarios. We hope that our work could pave the
way for scaling up 3D human recovery to in-the-wild scenes.
Authors (7)
Yongtao Ge
Wenjia Wang
Yongfan Chen
Fanzhou Wang
Lei Yang
Hao Chen
+1 more
Key Contributions
Demonstrates that diffusion models can efficiently generate high-quality synthetic data (HumanWild) that complements CG-rendered data for human mesh recovery. This synthetic data improves the generalization performance of 3D HPS models on diverse real-world scenes by providing realistic human identities and backgrounds.
Business Value
Enables the development of more robust and accurate 3D human understanding systems for applications like animation, gaming, robotics, and virtual try-on, by providing high-quality, diverse training data.