Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Video generation has achieved significant advances through rectified flow
techniques, but issues like unsmooth motion and misalignment between videos and
prompts persist. In this work, we develop a systematic pipeline that harnesses
human feedback to mitigate these problems and refine the video generation
model. Specifically, we begin by constructing a large-scale human preference
dataset focused on modern video generation models, incorporating pairwise
annotations across multi-dimensions. We then introduce VideoReward, a
multi-dimensional video reward model, and examine how annotations and various
design choices impact its rewarding efficacy. From a unified reinforcement
learning perspective aimed at maximizing reward with KL regularization, we
introduce three alignment algorithms for flow-based models. These include two
training-time strategies: direct preference optimization for flow (Flow-DPO)
and reward weighted regression for flow (Flow-RWR), and an inference-time
technique, Flow-NRG, which applies reward guidance directly to noisy videos.
Experimental results indicate that VideoReward significantly outperforms
existing reward models, and Flow-DPO demonstrates superior performance compared
to both Flow-RWR and supervised fine-tuning methods. Additionally, Flow-NRG
lets users assign custom weights to multiple objectives during inference,
meeting personalized video quality needs.
Authors (17)
Jie Liu
Gongye Liu
Jiajun Liang
Ziyang Yuan
Xiaokun Liu
Mingwu Zheng
+11 more
Submitted
January 23, 2025
Key Contributions
Develops a systematic pipeline using human feedback to improve video generation, focusing on motion smoothness and prompt alignment. It introduces a large-scale human preference dataset, a multi-dimensional VideoReward model, and three alignment algorithms (Flow-DPO, Flow-RWR, and an inference-time technique) from a unified RL perspective.
Business Value
Enables the creation of higher-quality, more controllable, and human-aligned video content, opening up new possibilities for creative industries and personalized media.