Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI researchers,Computer graphics professionals,Generative AI developers,Digital artists 2 weeks ago

Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

generative-ai › diffusion
📄 Abstract

Abstract: Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.
Authors (2)
Yuyao Zhang
Yu-Wing Tai
Submitted
October 18, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces Scale-DiT, a diffusion framework for ultra-high-resolution image generation (beyond 1Kx1K) that uses hierarchical local attention with low-resolution global guidance. This approach reduces attention complexity to near-linear, enabling efficient and semantically coherent synthesis at resolutions previously unattainable.

Business Value

Opens up new possibilities for creating highly detailed digital assets for gaming, film, virtual reality, and high-fidelity visualization, driving innovation in creative industries.