OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs (2103.09430v3)

Published 17 Mar 2021 in cs.LG

Abstract: Enabling effective and efficient ML over large-scale graph data (e.g., graphs with billions of edges) can have a great impact on both industrial and scientific applications. However, existing efforts to advance large-scale graph ML have been largely limited by the lack of a suitable public benchmark. Here we present OGB Large-Scale Challenge (OGB-LSC), a collection of three real-world datasets for facilitating the advancements in large-scale graph ML. The OGB-LSC datasets are orders of magnitude larger than existing ones, covering three core graph learning tasks -- link prediction, graph regression, and node classification. Furthermore, we provide dedicated baseline experiments, scaling up expressive graph ML models to the massive datasets. We show that expressive models significantly outperform simple scalable baselines, indicating an opportunity for dedicated efforts to further improve graph ML at scale. Moreover, OGB-LSC datasets were deployed at ACM KDD Cup 2021 and attracted more than 500 team registrations globally, during which significant performance improvements were made by a variety of innovative techniques. We summarize the common techniques used by the winning solutions and highlight the current best practices in large-scale graph ML. Finally, we describe how we have updated the datasets after the KDD Cup to further facilitate research advances. The OGB-LSC datasets, baseline code, and all the information about the KDD Cup are available at https://ogb.stanford.edu/docs/lsc/ .

Authors (6)

Weihua Hu (24 papers)
Matthias Fey (21 papers)
Hongyu Ren (31 papers)
Maho Nakata (13 papers)
Yuxiao Dong (119 papers)
Jure Leskovec (233 papers)

Citations (361)

View on Semantic Scholar

Summary

The paper presents OGB-LSC, introducing three extensive datasets (MAG240M, WikiKG90M, PCQM4M) for tackling core graph learning tasks.
It leverages advanced GNN architectures and innovative integration of textual features to achieve notable metrics such as 70.02% accuracy and 0.971 MRR.
The challenge provides practical benchmarks and theoretical insights that drive scalable and expressive model development in graph machine learning.

An Overview of OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs

The paper "OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs" presents a significant addition to the resources available for advancing ML capabilities on sizable graph datasets. The principal focus of this work is the introduction of the OGB Large-Scale Challenge (OGB-LSC), accompanied by three expansive datasets that model real-world problems at scale. These datasets, namely MAG240M, WikiKG90M, and PCQM4M, offer orders-of-magnitude larger graph data than previously available and encapsulate three fundamental graph learning tasks: node classification, link prediction, and graph regression.

Overview of the Datasets

MAG240M deals with node-level predictions in a heterogeneous graph drawn from the Microsoft Academic Graph (MAG). The task involves automatically annotating academic papers with their primary subject areas. This is framed as a classification problem where the model, utilizing a graph structure with over a billion citations, predicts the subject area of arXiv papers. The dataset is split temporally, emulating a straightforward real-world scenario of annotating forthcoming academic publications.

WikiKG90M centers on link-level prediction within Knowledge Graphs (KGs). Through this dataset, the paper proposes the task of completing missing links in a KG derived from Wikidata. This allows predictions of relationships between entities, posing a challenge of imputing missing triplets from a large pool of over 500 million triplets. The task uses Mean Reciprocal Rank (MRR) for evaluation, thus maintaining a rigorous standard for model performance.

PCQM4M focuses on graph-level regression. It uses quantum chemistry data to predict the HOMO-LUMO energy gap of molecules, derived from their 2D molecular graph representations. This dataset pushes toward fast, ML-based approximations of computationally intensive Density Functional Theory (DFT) calculations, presenting a task that is highly relevant to fields such as drug discovery and materials science.

Experimental Insights and Results

Extensive baseline experiments were performed using various state-of-the-art models scaled up to these massive datasets. In MAG240M, expressive GNN architectures utilizing advanced relational structures in heterogeneous graphs were shown to outperform simplified models. For WikiKG90M, the integration of textual features with traditional KG embedding models provided a significant boost in performance. Finally, in PCQM4M, larger and more expressive GNN variants, with structural augmentations like virtual nodes, demonstrated considerably better performance compared to traditional fingerprint-based approaches.

In terms of experimental results, the paper obtained strong numerical achievements across the tasks. The MAG240M dataset saw a maximum node classification accuracy of 70.02% using advanced GNNs, while on WikiKG90M, MRR scores of 0.971 were achieved by leveraging innovative embedding techniques and model ensembles. In PCQM4M, the MAE was driven down to 0.1208 eV by deploying convoluted architectures and larger models.

Discussion and Future Implications

The implications of these datasets extend into both practical applications and theoretical research. Practically, they provide benchmarks that reflect real-world constraints, fostering the development of solutions applicable to scenarios with comparable data scales and complexities. Theoretically, the datasets offer new challenges by encouraging innovation in scalable modeling and inference methodologies. The potential for significant interdisciplinary advances is evident in both computational efficiency and model expressiveness.

The future of large-scale graph ML will likely see these datasets facilitating more breakthroughs, particularly in expressive model development. Graph ML techniques can expand further by addressing current limits in scalability and leveraging multi-modal data inputs, such as integrating text with graph representations. Anticipated advancements are expected to broaden the application envelope of ML systems capable of handling dense connectivity and complex structural datasets.

In conclusion, "OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs" introduces valuable resources and sets a standard for future explorations into large-scale, real-world graph data handling. The paper’s contributions lie not only in the scope of the data provided but also in paving the way for future research directions in graph ML at scale.

PDF Markdown