arxiv_ai 95% Match Research Paper Researchers in GNNs and graph analytics,HPC engineers,Data scientists working with large graphs 1 week ago

Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training

graph-neural-networks › graph-learning

📄 Abstract

Abstract: Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. Plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.

Authors (4)

Aditya K. Ranjan

Siddharth Singh

Cunyang Wei

Abhinav Bhatele

Submitted

May 7, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Plexus introduces a novel 3D parallel approach for full-graph GNN training, enabling scalability to billion-edge graphs. It addresses key challenges like high communication overhead and load imbalance through techniques like a double permutation scheme and a performance model for optimal configuration, making large-scale graph analysis more feasible.

Business Value

Enables the training of GNNs on massive real-world graphs, unlocking new possibilities for insights in areas like social networks, recommendation systems, and scientific simulations that were previously computationally infeasible.

Paper Metadata

Innovation Type

Algorithmic and System Design

Deployment Feasibility

Requires significant HPC infrastructure (thousands of GPUs), making it feasible for large research institutions and cloud providers, but not for typical enterprise deployments.

Limitations Addressed

Memory capacity of GPUs for large graphs,High communication overhead in distributed GNN training,Load imbalance in irregular graph structures

Performance Gains

Achieves unprecedented scaling and performance on large graphs, outperforming existing methods on large GPU clusters.

Technical Tags

Graph Neural Networks3D ParallelismFull-graph TrainingLoad BalancingPerformance ModelingDistributed TrainingScalabilityGNN TrainingHigh-Performance ComputingDeep Learning

Research Topics

Scalable GNN TrainingDistributed Graph ProcessingHigh-Performance Computing for AIGraph Representation LearningEfficient Deep Learning

Methods & Architectures

3D ParallelismDouble Permutation SchemePerformance ModelMini-batch Sampling (as a contrast)Full-graph Training Graph Neural Networks (GNNs)

Applications & Tasks

Large-scale Graph Analysis Scientific Computing Memory limitations for large graphsCommunication overhead in distributed trainingLoad imbalance in graph processingScalability of GNN training Full-graph GNN training on billion-edge graphs

Datasets & Benchmarks

Datasets

six different graph datasets

Benchmarks

Scaling on up to 2048 GPUs of Perlmutter • Scaling on up to 1024 GPUs of Frontier

ScalabilityPerformanceAccuracy

Related Fields

Machine LearningHigh-Performance ComputingDistributed SystemsGraph TheoryData Mining

Keywords

Graph Neural NetworksGNNsDistributed TrainingScalabilityLarge Graphs3D ParallelismFull-graph TrainingLoad BalancingHigh-Performance ComputingDeep LearningGraph ProcessingSupercomputingAI Infrastructure

Academic Context

Perlmutter Frontier #Scalable GNN Training#Distributed Graph Processing#High-Performance Computing for AI#Graph Representation Learning#Efficient Deep Learning

Companies & Organizations

Research Institutions

Perlmutter Frontier

Technology Stack

ML Infrastructure

GPU clusters

Commercial Potential

Potential Products

Scalable GNN training platformsGraph analytics services

Target Industries

TechnologySocial MediaE-commerceScientific ResearchTelecommunications

Use Case Examples

Training GNNs on social networks with billions of usersAnalyzing large-scale biological networksPowering recommendation systems on massive user-item graphs

Competitive Edge

Offers a novel distributed training paradigm for full-graph GNNs, specifically designed to overcome memory and communication bottlenecks that limit existing methods on extremely large graphs.

Market Opportunity

Growing market for large-scale graph analytics and GNNs.

Revenue Models

Licensing of the technologycloud-based services

Resource Requirements

Compute Needs

Massive compute clusters (e.g., 2048 GPUs)

Data Requirements

Large-scale graphs with billions of edges

Deployment Constraints

Requires specialized HPC infrastructure,High energy consumption

Scalability

Designed for extreme scalability, demonstrated up to 2048 GPUs.

Production Readiness

Maturity Level

Research Prototype

Time to Market

Long-term (requires significant infrastructure investment)

View Full Paper Back to Papers