Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in video generation have made it possible to produce visually
compelling videos, with wide-ranging applications in content creation,
entertainment, and virtual reality. However, most existing diffusion
transformer based video generation models are limited to low-resolution outputs
(<=720P) due to the quadratic computational complexity of the attention
mechanism with respect to the output width and height. This computational
bottleneck makes native high-resolution video generation (1080P/2K/4K)
impractical for both training and inference. To address this challenge, we
present UltraGen, a novel video generation framework that enables i) efficient
and ii) end-to-end native high-resolution video synthesis. Specifically,
UltraGen features a hierarchical dual-branch attention architecture based on
global-local attention decomposition, which decouples full attention into a
local attention branch for high-fidelity regional content and a global
attention branch for overall semantic consistency. We further propose a
spatially compressed global modeling strategy to efficiently learn global
dependencies, and a hierarchical cross-window local attention mechanism to
reduce computational costs while enhancing information flow across different
local windows. Extensive experiments demonstrate that UltraGen can effectively
scale pre-trained low-resolution video models to 1080P and even 4K resolution
for the first time, outperforming existing state-of-the-art methods and
super-resolution based two-stage pipelines in both qualitative and quantitative
evaluations.
Authors (4)
Teng Hu
Jiangning Zhang
Zihan Su
Ran Yi
Submitted
October 21, 2025
Key Contributions
UltraGen presents a novel framework for efficient, end-to-end native high-resolution video generation (1080P/2K/4K). It overcomes the computational bottleneck of standard attention mechanisms in diffusion transformers by employing a hierarchical dual-branch attention architecture, enabling high-fidelity regional content and global coherence.
Business Value
Enables the creation of higher quality and more realistic video content for various industries, potentially reducing production costs and time for visual effects, game development, and virtual experiences.