Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The escalating adoption of diffusion models for applications such as image
generation demands efficient parallel inference techniques to manage their
substantial computational cost. However, existing diffusion parallelism
inference schemes often underutilize resources in heterogeneous multi-GPU
environments, where varying hardware capabilities or background tasks cause
workload imbalance. This paper introduces Spatio-Temporal Adaptive Diffusion
Inference (STADI), a novel framework to accelerate diffusion model inference in
such settings. At its core is a hybrid scheduler that orchestrates fine-grained
parallelism across both temporal and spatial dimensions. Temporally, STADI
introduces a novel computation-aware step allocator applied after warmup
phases, using a least-common-multiple-minimizing quantization technique to
reduce denoising steps on slower GPUs and execution synchronization. To further
minimize GPU idle periods, STADI executes an elastic patch parallelism
mechanism that allocates variably sized image patches to GPUs according to
their computational capability, ensuring balanced workload distribution through
a complementary spatial mechanism. Extensive experiments on both
load-imbalanced and heterogeneous multi-GPU clusters validate STADI's efficacy,
demonstrating improved load balancing and mitigation of performance
bottlenecks. Compared to patch parallelism, a state-of-the-art diffusion
inference framework, our method significantly reduces end-to-end inference
latency by up to 45% and significantly improves resource utilization on
heterogeneous GPUs.