Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Diffusion Transformers (DiT) have become the de-facto model for generating
high-quality visual content like videos and images. A huge bottleneck is the
attention mechanism where complexity scales quadratically with resolution and
video length. One logical way to lessen this burden is sparse attention, where
only a subset of tokens or patches are included in the calculation. However,
existing techniques fail to preserve visual quality at extremely high sparsity
levels and might even incur non-negligible compute overheads. To address this
concern, we propose Re-ttention, which implements very high sparse attention
for visual generation models by leveraging the temporal redundancy of Diffusion
Models to overcome the probabilistic normalization shift within the attention
mechanism. Specifically, Re-ttention reshapes attention scores based on the
prior softmax distribution history in order to preserve the visual quality of
the full quadratic attention at very high sparsity levels. Experimental results
on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that
Re-ttention requires as few as 3.1% of the tokens during inference,
outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and
MInference.
Authors (5)
Ruichen Chen
Keith G. Mills
Liyao Jiang
Chao Gao
Di Niu
Key Contributions
Proposes Re-ttention, a method for achieving very high sparse attention in Diffusion Transformers for visual generation. It leverages temporal redundancy and reshapes attention scores based on prior softmax distribution history to preserve visual quality at extreme sparsity levels, overcoming limitations of existing sparse attention techniques.
Business Value
Enables faster and more efficient generation of high-quality images and videos, reducing computational costs and enabling applications with limited resources.