arxiv_ml 90% Match Research Paper Researchers in computer vision and generative models,Video creators,AI engineers in media tech 1 week ago

OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

computer-vision › diffusion-models

📄 Abstract

Abstract: We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

Authors (4)

Mathis Koroglu

Hugo Caselles-Dupré

Guillaume Jeanneret Sanmiguel

Matthieu Cord

Submitted

November 15, 2024

arXiv Category

cs.CV

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 6225-6235

arXiv PDF

Key Contributions

Introduces OnlyFlow, a novel approach for text-to-video generation that uses optical flow extracted from an input video to condition the motion of the generated output. This allows for precise control over motion, respecting both the text prompt and the input video's dynamics.

Business Value

Enables more sophisticated and controllable video generation tools for creative industries, potentially reducing production costs and time for visual effects, animation, and personalized content.

Paper Metadata

Innovation Type

Novel Conditioning Mechanism

Deployment Feasibility

Feasible, as it builds upon existing diffusion model architectures and optical flow techniques. Requires GPU resources for generation.

Limitations Addressed

Addresses the lack of precise motion control in existing text-to-video generation methods, which often rely on less specific controls like masks or embeddings.

Performance Gains

Positively compares to state-of-the-art methods across a range of tasks, indicating improved motion control and generation quality.

Technical Tags

Text-to-video generationOptical flowMotion conditioningVideo diffusion modelsCamera movement controlVideo editingOptical flow estimationOptical flow encoder

Research Topics

Video GenerationComputer VisionGenerative ModelsMotion Synthesis

Methods & Architectures

Optical flow estimationOptical flow encoderFeature map injection into diffusion backbone Diffusion ModelOptical Flow Estimation NetworkEncoder-Decoder

Applications & Tasks

Media and Entertainment Content Creation Virtual Reality Augmented Reality Controlled Video GenerationMotion TransferVideo Editing Generating videos with specific motion characteristicsControlling camera movement in generated videosEditing existing videos by transferring motion

Datasets & Benchmarks

Benchmarks

Quantitative studies • Qualitative studies • User preference studies

Visual qualityMotion accuracyUser satisfaction

Related Fields

Computer VisionGenerative ModelsDeep LearningVideo ProcessingGraphics

Keywords

video generationtext-to-videooptical flowmotion conditioningdiffusion modelsvideo editingcamera controlgenerative AIdeep learningcomputer vision

Academic Context

#Video Generation#Computer Vision#Generative Models#Motion Synthesis

Commercial Potential

Potential Products

Advanced video editing softwareTools for generating animated contentVirtual production tools

Target Industries

Media and EntertainmentAdvertisingGamingEducation

Use Case Examples

Generating a video of a car driving through a city with specific camera movements.Applying the motion style of one video clip to another based on a text prompt.

Competitive Edge

Offers superior motion control compared to existing text-to-video methods by leveraging explicit optical flow conditioning.

Market Opportunity

Growing market for AI-powered video creation and editing tools.

Revenue Models

Software licensingAPI access.

Resource Requirements

Compute Needs

High, especially for training the diffusion model and optical flow components.

Data Requirements

Requires large datasets of videos paired with text descriptions and potentially optical flow annotations.

Deployment Constraints

Computational cost for inference, potential need for high-quality input videos.

Scalability

Scalability depends on the efficiency of the underlying diffusion model and optical flow estimation.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years for integration into commercial tools.

View Full Paper Back to Papers