Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Technical Report AI Researchers,ML Engineers,Content Creators,Game Developers 1 week ago

LongCat-Video Technical Report

generative-ai › diffusion
📄 Abstract

Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
Authors (11)
Meituan LongCat Team
Xunliang Cai
Qilong Huang
Zhuoliang Kang
Hongyu Li
Shijun Liang
+5 more
Submitted
October 25, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

LongCat-Video is a 13.6B parameter foundational video generation model built on the Diffusion Transformer (DiT) framework. It excels in efficient and high-quality long video generation, supporting multiple tasks (Text-to-Video, Image-to-Video, Video-Continuation) with a unified architecture and employing a coarse-to-fine strategy for minutes-long videos.

Business Value

Enables faster and more cost-effective creation of high-quality video content, potentially revolutionizing media production, game development, and virtual world creation.