Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A critical look at the evaluation of GNNs under heterophily: Are we really making progress? (2302.11640v2)

Published 22 Feb 2023 in cs.LG

Abstract: Node classification is a classical graph machine learning task on which Graph Neural Networks (GNNs) have recently achieved strong results. However, it is often believed that standard GNNs only work well for homophilous graphs, i.e., graphs where edges tend to connect nodes of the same class. Graphs without this property are called heterophilous, and it is typically assumed that specialized methods are required to achieve strong performance on such graphs. In this work, we challenge this assumption. First, we show that the standard datasets used for evaluating heterophily-specific models have serious drawbacks, making results obtained by using them unreliable. The most significant of these drawbacks is the presence of a large number of duplicate nodes in the datasets Squirrel and Chameleon, which leads to train-test data leakage. We show that removing duplicate nodes strongly affects GNN performance on these datasets. Then, we propose a set of heterophilous graphs of varying properties that we believe can serve as a better benchmark for evaluating the performance of GNNs under heterophily. We show that standard GNNs achieve strong results on these heterophilous graphs, almost always outperforming specialized models. Our datasets and the code for reproducing our experiments are available at https://github.com/yandex-research/heterophilous-graphs

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Oleg Platonov (3 papers)
  2. Denis Kuznedelev (21 papers)
  3. Michael Diskin (6 papers)
  4. Artem Babenko (43 papers)
  5. Liudmila Prokhorenkova (26 papers)
Citations (140)

Summary

  • The paper reveals that flawed benchmarks, caused by data leakage, have misled GNN evaluations under heterophily.
  • It demonstrates through empirical analysis that standard GNNs often outperform specialized models on heterophilous datasets.
  • The authors propose a novel suite of diverse benchmarks to drive more accurate and robust future GNN research.

A Critical Analysis of GNNs Performance on Heterophilous Graph Datasets

The paper titled "A critical look at the evaluation of GNNs under heterophily: Are we really making progress?" offers a detailed inquiry into the evaluation standards of Graph Neural Networks (GNNs) on heterophilous datasets. The focus is placed primarily on the reliability of existing benchmarks that gauge the performance of heterophily-specific models. The paper thoroughly dissects the issues with conventional datasets and subsequently proposes an improved set of benchmarks for future evaluations.

The authors commence by highlighting the assumption that traditional GNNs, built on the principle of exploiting graph homophily, are inadequate for handling heterophilous graphs. However, they argue against this preconceived notion by illustrating flaws in the datasets traditionally used for evaluating models under heterophily.

Problems in Existing Heterophilous Benchmarks

The researchers critically examine widely used datasets like "squirrel" and "chameleon." They uncover a significant issue: the presence of numerous duplicate nodes resulting in train-test data leakage. Such duplicates can deceive model evaluation as learning might inadvertently exploit data leakage instead of capturing genuine graph heterophily. Removal of these duplicates showed a marked impact on GNN performance, indicating previous assessments were potentially misleading. The paper also touches upon datasets like "texas" and "cornell," pointing to challenges such as extreme class imbalance and inadequate dataset size, leading to unreliable benchmarking outcomes.

A New Set of Heterophilous Datasets

Addressing the gaps, the authors propose a novel set of heterophilous graphs designed to serve as a more reliable benchmark suite. These datasets—roman-empire, amazon-ratings, minesweeper, tolokers, and questions—are derived from diverse domains including text, commerce, and synthetic environments, ensuring a broad spectrum of heterophilous patterns. The new datasets span a usable size range ensuring applicability to many GNN models that previously could not be comprehensively evaluated due to computational constraints.

Empirical Evaluations and Insights

In evaluating GNN performance on their proposed datasets, the authors conducted extensive experiments encompassing both standard GNNs and those designed specifically for heterophily. Interestingly, they observed that standard GNN models often outperform the specialized versions. This overturns previous assumptions about the inefficacy of conventional GNN architectures under heterophily, suggesting room for reconsideration in both theory and model design.

A critical insight presented by the paper is the efficacy of integrating design elements such as separating ego- and neighbor embeddings into GNN architecture, a trick found consistently effective across different datasets.

Implications and Future Directions

The paper's findings urge re-evaluation of past GNN progress on heterophilous datasets, suggesting that improvements might have been disproportionately attributed to data leakage rather than genuine advancements in understanding heterophily. Importantly, this work invites researchers to employ the newly proposed dataset benchmark to foster advancements that stand on robust empirical evaluations.

The paper potentially reorients the future trajectory in GNN research, motivating the creation of architectures that capture the versatile nature of heterophilous graphs. The exploration of informed graph measures such as adjusted homophily could further refine model adaptation strategies.

Overall, the paper initiates an imperative dialogue on the necessity for improved evaluation sets and deeper theoretical insights into the interplay between heterophily and graph learning frameworks. As a result, the contributions outlined by the authors are poised to significantly influence subsequent investigations in the field of graph machine learning.