Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We consider the problem of text-to-video generation tasks with precise
control for various applications such as camera movement control and
video-to-video editing. Most methods tacking this problem rely on providing
user-defined controls, such as binary masks or camera movement embeddings. In
our approach we propose OnlyFlow, an approach leveraging the optical flow
firstly extracted from an input video to condition the motion of generated
videos. Using a text prompt and an input video, OnlyFlow allows the user to
generate videos that respect the motion of the input video as well as the text
prompt. This is implemented through an optical flow estimation model applied on
the input video, which is then fed to a trainable optical flow encoder. The
output feature maps are then injected into the text-to-video backbone model. We
perform quantitative, qualitative and user preference studies to show that
OnlyFlow positively compares to state-of-the-art methods on a wide range of
tasks, even though OnlyFlow was not specifically trained for such tasks.
OnlyFlow thus constitutes a versatile, lightweight yet efficient method for
controlling motion in text-to-video generation. Models and code will be made
available on GitHub and HuggingFace.
Authors (4)
Mathis Koroglu
Hugo Caselles-Dupré
Guillaume Jeanneret Sanmiguel
Matthieu Cord
Submitted
November 15, 2024
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) Workshops, 2025, pp. 6225-6235
Key Contributions
Introduces OnlyFlow, a novel approach for text-to-video generation that uses optical flow extracted from an input video to condition the motion of the generated output. This allows for precise control over motion, respecting both the text prompt and the input video's dynamics.
Business Value
Enables more sophisticated and controllable video generation tools for creative industries, potentially reducing production costs and time for visual effects, animation, and personalized content.