arxiv_cv 98% Match Research Paper Researchers in generative AI,Developers of image/video generation tools,Machine learning engineers 2 weeks ago

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

computer-vision › diffusion-models

📄 Abstract

Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.

Authors (5)

Ruichen Chen

Keith G. Mills

Liyao Jiang

Chao Gao

Di Niu

Submitted

May 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes Re-ttention, a method for achieving very high sparse attention in Diffusion Transformers for visual generation. It leverages temporal redundancy and reshapes attention scores based on prior softmax distribution history to preserve visual quality at extreme sparsity levels, overcoming limitations of existing sparse attention techniques.

Business Value

Enables faster and more efficient generation of high-quality images and videos, reducing computational costs and enabling applications with limited resources.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High, as it aims to improve the efficiency of existing powerful models like DiTs, making them more practical for deployment.

Limitations Addressed

The quadratic scaling of attention complexity with resolution/video length and the degradation of visual quality at high sparsity levels in existing sparse attention methods.

Performance Gains

Preserves visual quality of full quadratic attention at very high sparsity levels.

Technical Tags

diffusion transformerssparse attentionvisual generationvideo generationimage generationattention mechanismtemporal redundancysoftmax distributioncomputational efficiency

Research Topics

Generative ModelsDiffusion ModelsAttention MechanismsEfficient Deep LearningComputer Vision

Methods & Architectures

Sparse attentionAttention statistical reshapeDiffusion Noise Optimization (DNO) Diffusion Transformers (DiT)

Applications & Tasks

Image Generation Video Generation Content Creation Quadratic complexity of attentionVisual quality degradation at high sparsityComputational overhead of attention High-quality visual generation (images, videos)

Related Fields

Deep LearningGenerative AIComputer VisionNatural Language Processing

Keywords

diffusion modelstransformerssparse attentionvisual generationvideo generationimage generationattentioncomputational efficiencyDiTtemporal redundancysoftmaxhigh resolution

Academic Context

#Generative Models#Diffusion Models#Attention Mechanisms#Efficient Deep Learning#Computer Vision

Commercial Potential

Potential Products

More efficient image/video generation softwareTools for creative content generationFaster AI model development platforms

Target Industries

Media and EntertainmentAdvertisingGamingDesign

Use Case Examples

Generating high-resolution images for marketing campaignsCreating realistic video clips for films or gamesAccelerating the design process for visual assets

Competitive Edge

Offers a more efficient approach to sparse attention for diffusion models, aiming to maintain high visual quality where other methods might fail.

Market Opportunity

Rapidly growing market for generative AI in visual content creation.

Revenue Models

Integration into existing generative AI platformslicensing of the core technology.

Resource Requirements

Compute Needs

Aims to reduce compute requirements compared to full attention, enabling generation on less powerful hardware or faster generation on high-end hardware.

Data Requirements

Requires large-scale image and video datasets for training diffusion models.

Deployment Constraints

The effectiveness of the sparsity strategy might vary across different types of visual data.

Scalability

Designed to improve scalability of diffusion models by reducing the computational bottleneck of the attention mechanism.

Production Readiness

Maturity Level

Research

Time to Market

Short to medium term, as it's an algorithmic improvement to existing models.

View Full Paper Back to Papers