Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: As large language models (LLMs) continue to scale, their workloads
increasingly rely on distributed execution across multiple GPUs. However, the
conventional bulk synchronous parallel~(BSP) model used in such settings
introduces significant performance inefficiencies. To characterize these
bottlenecks, we introduce the ''Three Taxes'' (Bulk Synchronous, Inter-Kernel
Data Locality, and Kernel Launch Overhead) as an analytical framework. We
propose moving beyond the rigid BSP model to address key inefficiencies in
distributed GPU execution. By exploiting libraries like Iris for Triton, we
gain access to in-kernel communication primitives that enable the design of
novel fine-grained programming patterns, offering greater flexibility and
performance than traditional BSP-based approaches. These patterns
systematically eliminate the three taxes by creating direct, tile-level
producer-consumer pipelines and replacing global barriers with fine-grained
dataflow synchronization. Applying this methodology to critical kernels, from
the foundational All-Gather + general matrix multiplication operation to the
complex Flash Decode algorithm, we observe a 10-20% speedup in end-to-end
latency over BSP-based approaches, establishing a more programmable and
efficient paradigm for distributed LLM workloads.
Key Contributions
This paper proposes a systems approach to eliminate performance inefficiencies in distributed LLMs across multiple GPUs, moving beyond the traditional Bulk Synchronous Parallel (BSP) model. It introduces the 'Three Taxes' framework and utilizes in-kernel communication primitives to enable fine-grained dataflow synchronization, significantly improving performance.
Business Value
Enables more efficient and cost-effective training and deployment of increasingly large language models, reducing the computational resources and time required for AI development.