Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: State-of-the-art text-to-video models excel at generating isolated clips but
fall short of creating the coherent, multi-shot narratives, which are the
essence of storytelling. We bridge this "narrative gap" with HoloCine, a model
that generates entire scenes holistically to ensure global consistency from the
first shot to the last. Our architecture achieves precise directorial control
through a Window Cross-Attention mechanism that localizes text prompts to
specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within
shots but sparse between them) ensures the efficiency required for minute-scale
generation. Beyond setting a new state-of-the-art in narrative coherence,
HoloCine develops remarkable emergent abilities: a persistent memory for
characters and scenes, and an intuitive grasp of cinematic techniques. Our work
marks a pivotal shift from clip synthesis towards automated filmmaking, making
end-to-end cinematic creation a tangible future. Our code is available at:
https://holo-cine.github.io/.
Authors (12)
Yihao Meng
Hao Ouyang
Yue Yu
Qiuyu Wang
Wen Wang
Ka Leong Cheng
+6 more
Submitted
October 23, 2025
Key Contributions
HoloCine generates entire scenes holistically for coherent, multi-shot video narratives, bridging the 'narrative gap' in current text-to-video models. It uses Window Cross-Attention for prompt localization and Sparse Inter-Shot Self-Attention for efficiency, enabling minute-scale generation with emergent cinematic abilities.
Business Value
Automates aspects of filmmaking and video production, enabling faster creation of compelling visual narratives for various media.