Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 98% Match Research Paper AI Researchers,ML Engineers,Artists,Designers,Developers of image generation tools 1 week ago

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

computer-vision › diffusion-models
📄 Abstract

Abstract: Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
Authors (5)
Sungho Koh
SeungJu Cha
Hyunwoo Oh
Kwanyoung Lee
Dong-Jin Kim
Submitted
October 29, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

ScaleDiff is a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without additional training. It introduces Neighborhood Patch Attention (NPA) to reduce computational redundancy and Latent Frequency Mixing (LFM) for better detail generation, achieving state-of-the-art performance among training-free methods.

Business Value

Enables the creation of high-quality, high-resolution images from text prompts more efficiently. This is valuable for industries like advertising, gaming, film, and design, where visual content is paramount.