Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods (2110.14446v1)

Published 27 Oct 2021 in cs.LG, cs.SI, and stat.ML

Abstract: Many widely used datasets for graph machine learning tasks have generally been homophilous, where nodes with similar labels connect to each other. Recently, new Graph Neural Networks (GNNs) have been developed that move beyond the homophily regime; however, their evaluation has often been conducted on small graphs with limited application domains. We collect and introduce diverse non-homophilous datasets from a variety of application areas that have up to 384x more nodes and 1398x more edges than prior datasets. We further show that existing scalable graph learning and graph minibatching techniques lead to performance degradation on these non-homophilous datasets, thus highlighting the need for further work on scalable non-homophilous methods. To address these concerns, we introduce LINKX -- a strong simple method that admits straightforward minibatch training and inference. Extensive experimental results with representative simple methods and GNNs across our proposed datasets show that LINKX achieves state-of-the-art performance for learning on non-homophilous graphs. Our codes and data are available at https://github.com/CUAI/Non-Homophily-Large-Scale.

Citations (296)

View on Semantic Scholar

Summary

The paper introduces LINKX, a simple yet robust model that decouples node features and adjacency to handle non-homophilous graphs effectively.
It presents large-scale, diverse datasets from domains like social and citation networks to address dataset scarcity in non-homophilous graph research.
The study demonstrates that traditional GNN minibatching techniques degrade on non-homophilous graphs, emphasizing the need for scalable learning methods.

Exploring Non-Homophilous Graph Learning: Benchmarks and Methods

The paper presents a comprehensive paper on machine learning for non-homophilous graphs, a significant shift from the dominantly homophilous datasets typically used in graph neural network (GNN) research. It identifies three core issues: the scarcity of large, diverse non-homophilous datasets, the inadequacy of current graph minibatching and learning techniques for such datasets, and the non-scalability of existing non-homophilous methods.

Contributions and Findings

New Datasets

The authors address the lack of diverse datasets by introducing a suite of large-scale, non-homophilous graph datasets from diverse domains, some with millions of nodes and edges. These datasets include varied contexts such as social networks (e.g., Pokec, genius), citation networks (e.g., arXiv-year, snap-patents), and others like the wiki dataset with over 1.9 million nodes. These are notably larger than previous datasets, enhancing the evaluation of learning methods on non-homophilous graphs.

Evaluation of Scalability and Minibatching

The paper critiques existing scalable graph learning methods like SGC and CS because they rely on homophily assumptions, which lead to performance degradation in non-homophilous settings. Empirical evidence shows that common graph minibatching techniques, such as GraphSAINT, also suffer significantly in performance on these graphs.

Introduction of LINKX

In response to the highlighted issues, the authors propose LINKX, a new model designed to function well with non-homophilous graphs. LINKX stands out due to its straightforward design, which separates the embedding of adjacency matrices and node features, allowing it to efficiently leverage node feature information and graph topology independently. It facilitates simple row-wise minibatching, which maintains performance without the overhead of complex graph-specific techniques like neighbor sampling or subgraph generation.

Methodological Insights

LINKX's architecture, utilizing multi-layer perceptrons (MLPs) for embedding, provides a scalable solution that effectively combines the strengths of traditional MLPs and the LINK method, which utilizes adjacency information more robustly in non-homophilous settings. The architecture involves processing adjacency and feature matrices separately, combining them through simple linear transformations before generating predictions. This design allows LINKX to avoid the substantial scalability issues faced by more complex GNNs without sacrificing the ability to handle large-scale non-homophilous datasets.

Experimental Validation

Experiments conducted across several new and prior datasets consistently show LINKX outperforming other methods, including state-of-the-art GNNs in non-homophilous settings. Notably, LINKX's performance is superior when scalable, and effective training methods are necessary, as seen with LINKX maintaining strong performance using simplified minibatching strategies. These results are demonstrative of LINKX's robustness in diverse, large, non-homophilous graphs, and its potential for practical applications beyond the confines of theory-heavy benchmark datasets.

Implications and Future Research

The introduction of LINKX and the accompanying datasets represent significant advancements in non-homophily graph learning. Practically, this work calls for the reevaluation of existing GNN architectures and urges consideration of non-homophilous settings in future benchmark designs. It also raises the importance of method scalability for real-world applications, especially in non-homophilous contexts like anti-fraud systems and certain social network analyses.

From a theoretical standpoint, the results highlight the insufficiency of relying purely on homophily assumptions in graph learning and open avenues for developing methods that accurately capture and exploit the complex node-label relationships present in non-homophilous graphs.

Overall, the work provides a robust foundation upon which future research can construct more inclusive, scalable, and efficient graph learning methodologies. Continued exploration in this domain will likely yield significant insights, further bridging the gap between homophilous-driven theory and the actual needs of complex, real-world graph structures.

PDF Markdown

Related Papers

GitHub

GitHub - CUAI/Non-Homophily-Large-Scale: [NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods (114 stars)