On Efficient Scaling of GNNs via IO-Aware Layers Implementations

Graph neural networks (GNNs) do not work well with GPUs: aggregation over neighbors, the main operation in GNNs, involves reading from unordered sections of GPU memory followed by aggregation. This does not fit well with modern hardware, which is primarily optimized for ordered reads (for example, in matrix multiplications/attention). In the article (spotlight on ICML), they did for GNNs roughly what FlashAttention did for transformers: they rewrote the main layers to reduce data movement between memory and computing blocks. An IO-aware approach is used for attention layers in the spirit of FlashAttention, and for aggregating layers, additional parallelization of nodes with a large number of neighbors is implemented. The authors also showed that for some convolutional layers, modern NVIDIA solutions are already faster than most specialized implementations. The result is up to 8.5× speedup and up to 76× less memory in certain scenarios. All implementations are available as drop-in replacements for popular GNN frameworks. By the way, the article is essentially a project of teachers and students from SHAD.