Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recently, diffusion models have shown their impressive ability in visual
generation tasks. Besides static images, more and more research attentions have
been drawn to the generation of realistic videos. The video generation not only
has a higher requirement for the quality, but also brings a challenge in
ensuring the video continuity. Among all the video generation tasks,
human-involved contents, such as human dancing, are even more difficult to
generate due to the high degrees of freedom associated with human motions. In
this paper, we propose a novel framework, named as DANCER (Dance ANimation via
Condition Enhancement and Rendering with Diffusion Model), for realistic
single-person dance synthesis based on the most recent stable video diffusion
model. As the video generation is generally guided by a reference image and a
video sequence, we introduce two important modules into our framework to fully
benefit from the two inputs. More specifically, we design an Appearance
Enhancement Module (AEM) to focus more on the details of the reference image
during the generation, and extend the motion guidance through a Pose Rendering
Module (PRM) to capture pose conditions from extra domains. To further improve
the generation capability of our model, we also collect a large amount of video
data from Internet, and generate a novel datasetTikTok-3K to enhance the model
training. The effectiveness of the proposed model has been evaluated through
extensive experiments on real-world datasets, where the performance of our
model is superior to that of the state-of-the-art methods. All the data and
codes will be released upon acceptance.
Authors (3)
Yucheng Xing
Jinxing Yin
Xiaodong Liu
Submitted
October 31, 2025
Key Contributions
DANCER introduces a novel framework for realistic single-person dance synthesis using diffusion models. It enhances video generation by incorporating appearance enhancement and conditional rendering modules, leveraging both reference images and video sequences to improve quality and temporal continuity.
Business Value
Enables more efficient and realistic creation of animated content, reducing the cost and time for producing dance sequences in games, films, and virtual experiences.