arxiv_cv 95% Match Research paper Researchers in generative AI,Developers of video generation tools 2 weeks ago

Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

computer-vision › diffusion-models

📄 Abstract

Abstract: Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.

Authors (3)

Takehiro Aoshima

Yusuke Shinohara

Byeongseon Park

Submitted

October 22, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes Video Consistency Distance (VCD), a novel metric for enhancing temporal consistency in image-to-video generation. VCD operates in the frequency space of video frame features to effectively capture frame information, addressing limitations of previous reward functions that focused on overall video quality.

Business Value

Enables the creation of more coherent and realistic videos from static images, which can be valuable for applications in entertainment, advertising, and virtual content creation.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

Moderate, as it requires fine-tuning existing diffusion models and integrating a new metric, but the core generation process remains similar.

Limitations Addressed

Limited temporal consistency in image-to-video generation tasks when using conventional reward-based fine-tuning methods.

Technical Tags

image-to-video generationtemporal consistencyreward-based fine-tuningfrequency domain analysisdiffusion modelsvideo generationfeature extractionmetric learning

Research Topics

Video GenerationGenerative ModelsComputer VisionDeep LearningTemporal Modeling

Methods & Architectures

Reward-based fine-tuningFrequency domain analysisFeature extraction Diffusion models

Applications & Tasks

Content creation Media generation Enhancing temporal consistencyImproving video quality Image-to-video generation

Related Fields

Computer GraphicsMachine LearningDeep Learning

Keywords

video generationtemporal consistencydiffusion modelsreward-based fine-tuningimage-to-videofrequency domaingenerative AIdeep learningcomputer visionmetric learningvideo synthesis

Academic Context

#Video Generation#Generative Models#Computer Vision#Deep Learning#Temporal Modeling

Commercial Potential

Potential Products

AI-powered video editing toolsAutomated video content generation platforms

Target Industries

Media and EntertainmentAdvertisingGaming

Use Case Examples

Generating short video clips from product imagesCreating animated sequences for marketing materials

Competitive Edge

Offers improved temporal coherence compared to existing reward-based fine-tuning methods for I2V generation.

Market Opportunity

Growing market for AI-generated video content.

Revenue Models

Licensing the technologyoffering API services for video generation.

Resource Requirements

Compute Needs

High, typical for training and fine-tuning large diffusion models.

Data Requirements

Requires access to pre-trained diffusion models and potentially datasets for fine-tuning, though the method aims to reduce reliance on real-world video datasets for fine-tuning.

Deployment Constraints

Computational resources for inference, potential latency issues.

Scalability

Scalability depends on the underlying diffusion model architecture and the efficiency of the VCD metric computation.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Moderate, for the novel VCD metric and its application in reward-based fine-tuning.

View Full Paper Back to Papers