Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper Filmmakers,Video Producers,AI Researchers,Content Creators 2 weeks ago

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

computer-vision › video-understanding
📄 Abstract

Abstract: State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: https://holo-cine.github.io/.
Authors (12)
Yihao Meng
Hao Ouyang
Yue Yu
Qiuyu Wang
Wen Wang
Ka Leong Cheng
+6 more
Submitted
October 23, 2025
arXiv Category
cs.CV
arXiv PDF Code

Key Contributions

HoloCine generates entire scenes holistically for coherent, multi-shot video narratives, bridging the 'narrative gap' in current text-to-video models. It uses Window Cross-Attention for prompt localization and Sparse Inter-Shot Self-Attention for efficiency, enabling minute-scale generation with emergent cinematic abilities.

Business Value

Automates aspects of filmmaking and video production, enabling faster creation of compelling visual narratives for various media.

View Code on GitHub