arxiv_cv 95% Match Research Paper AI researchers,Computer graphics professionals,Generative AI developers,Digital artists 2 weeks ago

Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention

generative-ai › diffusion

📄 Abstract

Abstract: Ultra-high-resolution text-to-image generation demands both fine-grained texture synthesis and globally coherent structure, yet current diffusion models remain constrained to sub-$1K \times 1K$ resolutions due to the prohibitive quadratic complexity of attention and the scarcity of native $4K$ training data. We present \textbf{Scale-DiT}, a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance, enabling efficient, scalable, and semantically coherent image synthesis at ultra-high resolutions. Specifically, high-resolution latents are divided into fixed-size local windows to reduce attention complexity from quadratic to near-linear, while a low-resolution latent equipped with scaled positional anchors injects global semantics. A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail. To maximize inference efficiency, we repermute token sequence in Hilbert curve order and implement a fused-kernel for skipping masked operations, resulting in a GPU-friendly design. Extensive experiments demonstrate that Scale-DiT achieves more than $2\times$ faster inference and lower memory usage compared to dense attention baselines, while reliably scaling to $4K \times 4K$ resolution without requiring additional high-resolution training data. On both quantitative benchmarks (FID, IS, CLIP Score) and qualitative comparisons, Scale-DiT delivers superior global coherence and sharper local detail, matching or outperforming state-of-the-art methods that rely on native 4K training. Taken together, these results highlight hierarchical local attention with guided low-resolution anchors as a promising and effective approach for advancing ultra-high-resolution image generation.

Authors (2)

Yuyao Zhang

Yu-Wing Tai

Submitted

October 18, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces Scale-DiT, a diffusion framework for ultra-high-resolution image generation (beyond 1Kx1K) that uses hierarchical local attention with low-resolution global guidance. This approach reduces attention complexity to near-linear, enabling efficient and semantically coherent synthesis at resolutions previously unattainable.

Business Value

Opens up new possibilities for creating highly detailed digital assets for gaming, film, virtual reality, and high-fidelity visualization, driving innovation in creative industries.

Paper Metadata

Innovation Type

Architectural/Methodological

Deployment Feasibility

Moderate, requires significant computational resources for training and inference due to the scale of generated images.

Limitations Addressed

The prohibitive quadratic complexity of attention mechanisms and lack of high-resolution training data that limit current diffusion models to sub-1Kx1K resolutions.

Performance Gains

Enables generation of images at ultra-high resolutions (e.g., 4K) with fine-grained textures and globally coherent structures, overcoming previous resolution limitations.

Technical Tags

ultra-high-resolutionimage generationdiffusion modelshierarchical local attentionlow-resolution guidancescale-ditnear-linear complexitylorahilbert curve

Research Topics

Image GenerationDiffusion ModelsHigh-Resolution SynthesisDeep Learning ArchitecturesGenerative AI

Methods & Architectures

Hierarchical local attentionLow-resolution global guidanceLoRA adaptationHilbert curve reorderingDiffusion framework Diffusion ModelsScale-DiT

Applications & Tasks

Digital Art Content Creation Media Production Virtual Environments High-Resolution Imaging Quadratic complexity of attention for high resolutionsScarcity of native 4K training dataGenerating globally coherent structures at high resolutionsFine-grained texture synthesis Ultra-high-resolution text-to-image generationGenerating semantically coherent imagesEfficient high-resolution synthesis

Related Fields

Computer VisionGenerative AIDeep LearningImage SynthesisMachine Learning

Keywords

ultra-high-resolutionimage generationdiffusion modelScale-DiThierarchical attentionlocal attentionglobal guidancetext-to-imagegenerative AIdeep learning4K generation

Academic Context

#Image Generation#Diffusion Models#High-Resolution Synthesis#Deep Learning Architectures#Generative AI

Technology Stack

Frameworks & Libraries

LoRA

Commercial Potential

Potential Products

High-resolution image generation serviceTools for creating ultra-detailed virtual environmentsAI-powered asset generation for games and film

Target Industries

Media and EntertainmentGamingAdvertisingVirtual RealityArchitecture and Design

Use Case Examples

Generating photorealistic 4K textures for 3D modelsCreating ultra-high-resolution concept artSynthesizing detailed scenes for virtual reality experiences

Competitive Edge

Advances the state-of-the-art in diffusion models by enabling generation at resolutions far exceeding previous capabilities, addressing a key limitation of existing methods.

Market Opportunity

Massive and growing market for high-quality digital content creation.

Revenue Models

Licensing of the modelAPI access for generation servicesintegration into creative software suites.

Resource Requirements

Compute Needs

Very High, for training and generating ultra-high-resolution images.

Data Requirements

Requires large datasets of high-resolution images, potentially augmented or synthesized.

Deployment Constraints

Significant computational resources (GPU memory, processing power) for inference.

Scalability

The hierarchical local attention mechanism is designed to scale efficiently to higher resolutions.

Production Readiness

Maturity Level

Research/Development

Time to Market

Medium to Long, for optimizing for practical deployment and accessibility.

Patent Potential

High, for the novel architectural components (hierarchical local attention, guidance mechanism).

View Full Paper Back to Papers