Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reward-based fine-tuning of video diffusion models is an effective approach
to improve the quality of generated videos, as it can fine-tune models without
requiring real-world video datasets. However, it can sometimes be limited to
specific performances because conventional reward functions are mainly aimed at
enhancing the quality across the whole generated video sequence, such as
aesthetic appeal and overall consistency. Notably, the temporal consistency of
the generated video often suffers when applying previous approaches to
image-to-video (I2V) generation tasks. To address this limitation, we propose
Video Consistency Distance (VCD), a novel metric designed to enhance temporal
consistency, and fine-tune a model with the reward-based fine-tuning framework.
To achieve coherent temporal consistency relative to a conditioning image, VCD
is defined in the frequency space of video frame features to capture frame
information effectively through frequency-domain analysis. Experimental results
across multiple I2V datasets demonstrate that fine-tuning a video generation
model with VCD significantly enhances temporal consistency without degrading
other performance compared to the previous method.
Authors (3)
Takehiro Aoshima
Yusuke Shinohara
Byeongseon Park
Submitted
October 22, 2025
Key Contributions
Proposes Video Consistency Distance (VCD), a novel metric for enhancing temporal consistency in image-to-video generation. VCD operates in the frequency space of video frame features to effectively capture frame information, addressing limitations of previous reward functions that focused on overall video quality.
Business Value
Enables the creation of more coherent and realistic videos from static images, which can be valuable for applications in entertainment, advertising, and virtual content creation.