arxiv_cv 95% Match Technical Report AI Researchers,ML Engineers,Content Creators,Game Developers 1 week ago

LongCat-Video Technical Report

generative-ai › diffusion

📄 Abstract

Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

Authors (11)

Meituan LongCat Team

Xunliang Cai

Qilong Huang

Zhuoliang Kang

Hongyu Li

Shijun Liang

+5 more

Submitted

October 25, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

LongCat-Video is a 13.6B parameter foundational video generation model built on the Diffusion Transformer (DiT) framework. It excels in efficient and high-quality long video generation, supporting multiple tasks (Text-to-Video, Image-to-Video, Video-Continuation) with a unified architecture and employing a coarse-to-fine strategy for minutes-long videos.

Business Value

Enables faster and more cost-effective creation of high-quality video content, potentially revolutionizing media production, game development, and virtual world creation.

Paper Metadata

Innovation Type

Model Architecture and Training

Deployment Feasibility

Moderate to High. While large, the focus on efficient inference suggests practical usability. Deployment requires significant compute resources.

Limitations Addressed

Inefficiency in long video inference,Lack of temporal coherence in generated long videos,Need for unified models across different video generation tasks

Technical Tags

Video GenerationLong Video InferenceWorld ModelsDiffusion Transformer (DiT)Text-to-VideoImage-to-VideoVideo ContinuationTemporal CoherenceEfficient InferenceCoarse-to-fine Strategy

Research Topics

Generative Video ModelsLong-form Content GenerationEfficient AIFoundation ModelsVideo Synthesis

Methods & Architectures

Diffusion Transformer (DiT)Coarse-to-fine generationTemporal and spatial axis scaling Diffusion Transformer (DiT)

Applications & Tasks

Content Creation Film Production Gaming Virtual Reality Simulation Generating high-quality long videosEfficient video inferenceMaintaining temporal coherenceUnified architecture for multiple video tasks Text-to-Video generationImage-to-Video generationVideo ContinuationLong video generation

Related Fields

Generative AIComputer VisionDeep LearningNatural Language ProcessingAI Infrastructure

Keywords

Video GenerationLong VideoDiffusion ModelsDiffusion TransformerDiTWorld ModelsText-to-VideoEfficient InferenceTemporal CoherenceGenerative AIFoundation ModelVideo Continuation

Academic Context

#Generative Video Models#Long-form Content Generation#Efficient AI#Foundation Models#Video Synthesis

Technology Stack

Frameworks & Libraries

Diffusion Transformer (DiT)

Commercial Potential

Potential Products

AI-powered video editing toolsAutomated content generation platformsTools for creating virtual environments

Target Industries

Media and EntertainmentGamingAdvertisingVirtual RealityEducation

Use Case Examples

Generating short films from text descriptionsCreating dynamic backgrounds for gamesExtending existing video clips

Competitive Edge

Positions itself as a leading model for efficient and high-quality long video generation, unifying multiple tasks within a single architecture and leveraging the power of Diffusion Transformers.

Resource Requirements

Compute Needs

High compute requirements for training (13.6B parameters). Inference requirements are optimized but still substantial for high-resolution, long videos.

Data Requirements

Requires large-scale video datasets for pre-training and fine-tuning across various tasks.

Deployment Constraints

High computational cost for generating long videos, even with efficiency optimizations. Requires significant GPU resources.

Scalability

The DiT architecture and coarse-to-fine strategy are designed for scalability in terms of video length and resolution.

View Full Paper Back to Papers