Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Video generation is a critical pathway toward world models, with efficient
long video inference as a key capability. Toward this end, we introduce
LongCat-Video, a foundational video generation model with 13.6B parameters,
delivering strong performance across multiple video generation tasks. It
particularly excels in efficient and high-quality long video generation,
representing our first step toward world models. Key features include: Unified
architecture for multiple tasks: Built on the Diffusion Transformer (DiT)
framework, LongCat-Video supports Text-to-Video, Image-to-Video, and
Video-Continuation tasks with a single model; Long video generation:
Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high
quality and temporal coherence in the generation of minutes-long videos;
Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes
by employing a coarse-to-fine generation strategy along both the temporal and
spatial axes. Block Sparse Attention further enhances efficiency, particularly
at high resolutions; Strong performance with multi-reward RLHF: Multi-reward
RLHF training enables LongCat-Video to achieve performance on par with the
latest closed-source and leading open-source models. Code and model weights are
publicly available to accelerate progress in the field.
Authors (11)
Meituan LongCat Team
Xunliang Cai
Qilong Huang
Zhuoliang Kang
Hongyu Li
Shijun Liang
+5 more
Submitted
October 25, 2025
Key Contributions
LongCat-Video is a 13.6B parameter foundational video generation model built on the Diffusion Transformer (DiT) framework. It excels in efficient and high-quality long video generation, supporting multiple tasks (Text-to-Video, Image-to-Video, Video-Continuation) with a unified architecture and employing a coarse-to-fine strategy for minutes-long videos.
Business Value
Enables faster and more cost-effective creation of high-quality video content, potentially revolutionizing media production, game development, and virtual world creation.