Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Diffusion Transformers (DiT) are renowned for their impressive generative
performance; however, they are significantly constrained by considerable
computational costs due to the quadratic complexity in self-attention and the
extensive sampling steps required. While advancements have been made in
expediting the sampling process, the underlying architectural inefficiencies
within DiT remain underexplored. We introduce SparseDiT, a novel framework that
implements token sparsification across spatial and temporal dimensions to
enhance computational efficiency while preserving generative quality.
Spatially, SparseDiT employs a tri-segment architecture that allocates token
density based on feature requirements at each layer: Poolingformer in the
bottom layers for efficient global feature extraction, Sparse-Dense Token
Modules (SDTM) in the middle layers to balance global context with local
detail, and dense tokens in the top layers to refine high-frequency details.
Temporally, SparseDiT dynamically modulates token density across denoising
stages, progressively increasing token count as finer details emerge in later
timesteps. This synergy between SparseDiT spatially adaptive architecture and
its temporal pruning strategy enables a unified framework that balances
efficiency and fidelity throughout the generation process. Our experiments
demonstrate SparseDiT effectiveness, achieving a 55% reduction in FLOPs and a
175% improvement in inference speed on DiT-XL with similar FID score on 512x512
ImageNet, a 56% reduction in FLOPs across video generation datasets, and a 69%
improvement in inference speed on PixArt-$\alpha$ on text-to-image generation
task with a 0.24 FID score decrease. SparseDiT provides a scalable solution for
high-quality diffusion-based generation compatible with sampling optimization
techniques.