Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Graph neural networks (GNNs) leverage the connectivity and structure of
real-world graphs to learn intricate properties and relationships between
nodes. Many real-world graphs exceed the memory capacity of a GPU due to their
sheer size, and training GNNs on such graphs requires techniques such as
mini-batch sampling to scale. The alternative approach of distributed
full-graph training suffers from high communication overheads and load
imbalance due to the irregular structure of graphs. We propose a
three-dimensional (3D) parallel approach for full-graph training that tackles
these issues and scales to billion-edge graphs. In addition, we introduce
optimizations such as a double permutation scheme for load balancing, and a
performance model to predict the optimal 3D configuration of our parallel
implementation -- Plexus. We evaluate Plexus on six different graph datasets
and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of
Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state
of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and
7.0-54.2x on Frontier.
Authors (4)
Aditya K. Ranjan
Siddharth Singh
Cunyang Wei
Abhinav Bhatele
Key Contributions
Plexus introduces a novel 3D parallel approach for full-graph GNN training, enabling scalability to billion-edge graphs. It addresses key challenges like high communication overhead and load imbalance through techniques like a double permutation scheme and a performance model for optimal configuration, making large-scale graph analysis more feasible.
Business Value
Enables the training of GNNs on massive real-world graphs, unlocking new possibilities for insights in areas like social networks, recommendation systems, and scientific simulations that were previously computationally infeasible.