arxiv_cv 90% Match Research Paper AI Researchers,Content Creators,Media Engineers,Software Developers 4 days ago

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

generative-ai › diffusion

📄 Abstract

Abstract: Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

Authors (6)

Takashi Isobe

He Cui

Dong Zhou

Mengmeng Ge

Dong Li

Emad Barsoum

Submitted

March 24, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes Hummingbird, a lightweight text-to-video generation framework that significantly reduces model size (U-Net from 1.4B to 0.7B parameters) through pruning and enhances visual quality via visual feedback learning. It also introduces a data processing pipeline using LLMs and VQA models for prompt enhancement.

Business Value

Democratizes high-quality video generation by making it accessible on consumer hardware, enabling faster content creation workflows for marketing, social media, and personalized experiences.

Paper Metadata

Innovation Type

Model Compression and Training Strategy

Deployment Feasibility

High, specifically designed for deployment on resource-limited devices like iGPUs and mobile phones.

Limitations Addressed

The trade-off between computational efficiency and visual quality in text-to-video models, making them difficult to deploy on resource-constrained devices.

Performance Gains

Achieves high-quality video generation with significantly reduced model size and computational overhead, enabling deployment on devices like iGPUs and mobile phones.

Technical Tags

text-to-video generationefficient modelscomputational efficiencyvisual qualitylightweight frameworkmodel pruningvisual feedback learningU-Netparameter reductionLLMsVideo Quality Assessment (VQA)

Research Topics

Generative AIText-to-Video SynthesisDiffusion ModelsModel CompressionEfficient Deep LearningMultimodal AI

Methods & Architectures

Model PruningVisual Feedback LearningLLM-enhanced prompt processingVQA-based quality enhancement Hummingbird (T2V framework)U-Net

Applications & Tasks

Content Creation Media and Entertainment Advertising Virtual Reality Balancing computational efficiency and visual qualityNeed for smaller, efficient models for resource-limited devicesHigh computational cost of existing T2V models Text-to-Video GenerationEfficient video synthesisHigh-quality video generation on limited hardware

Related Fields

Generative AIDiffusion ModelsComputer VisionNatural Language ProcessingModel CompressionVideo Generation

Keywords

Text-to-VideoGenerative AIDiffusion ModelsEfficient AIModel PruningLightweight ModelsVideo GenerationLLMContent CreationMobile AI

Academic Context

#Generative AI#Text-to-Video Synthesis#Diffusion Models#Model Compression#Efficient Deep Learning#Multimodal AI

Technology Stack

Frameworks & Libraries

PyTorch

Programming Languages

Python

Commercial Potential

Potential Products

Efficient text-to-video generation tools for consumersAPIs for generating short video clips from text promptsPlugins for video editing software

Target Industries

Media and EntertainmentAdvertisingSocial MediaGamingMarketing

Use Case Examples

Generating short promotional videos for social media campaigns.Creating personalized video content based on user descriptions.Assisting game developers in generating in-game video assets.

Competitive Edge

Offers a compelling balance of efficiency and quality, making advanced T2V generation accessible on consumer hardware, unlike larger, more resource-intensive models.

Market Opportunity

Rapidly growing market for AI-powered content creation tools.

Revenue Models

Software licensingAPI access feesintegration into creative suites.

Resource Requirements

Compute Needs

Low to moderate, designed to run on consumer GPUs (iGPUs) and mobile devices.

Data Requirements

Large-scale text-video datasets for training.

Deployment Constraints

Limited by the computational power and memory of the target device; quality may still be lower than state-of-the-art large models.

Scalability

Highly scalable to various consumer devices due to its lightweight design.

Production Readiness

Maturity Level

Research Prototype

Time to Market

1-3 years, for productization and integration.

Patent Potential

Moderate, potential for novel pruning techniques or visual feedback learning strategies.

View Full Paper Back to Papers