Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Ultra-high-resolution text-to-image generation demands both fine-grained
texture synthesis and globally coherent structure, yet current diffusion models
remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive
quadratic complexity of attention and the scarcity of native $4K$ training
data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces
hierarchical local attention with low-resolution global guidance, enabling
efficient, scalable, and semantically coherent image synthesis at ultra-high
resolutions. Specifically, high-resolution latents are divided into fixed-size
local windows to reduce attention complexity from quadratic to near-linear,
while a low-resolution latent equipped with scaled positional anchors injects
global semantics. A lightweight LoRA adaptation bridges global and local
pathways during denoising, ensuring consistency across structure and detail. To
maximize inference efficiency, we repermute token sequence in Hilbert curve
order and implement a fused-kernel for skipping masked operations, resulting in
a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT
achieves more than $2\times$ faster inference and lower memory usage compared
to dense attention baselines, while reliably scaling to $4K \times 4K$
resolution without requiring additional high-resolution training data. On both
quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons,
Scale-DiT delivers superior global coherence and sharper local detail, matching
or outperforming state-of-the-art methods that rely on native 4K training.
Taken together, these results highlight hierarchical local attention with
guided low-resolution anchors as a promising and effective approach for
advancing ultra-high-resolution image generation.
Submitted
October 18, 2025
Key Contributions
Introduces Scale-DiT, a diffusion framework for ultra-high-resolution image generation (beyond 1Kx1K) that uses hierarchical local attention with low-resolution global guidance. This approach reduces attention complexity to near-linear, enabling efficient and semantically coherent synthesis at resolutions previously unattainable.
Business Value
Opens up new possibilities for creating highly detailed digital assets for gaming, film, virtual reality, and high-fidelity visualization, driving innovation in creative industries.