The Open Catalyst 2020 (OC20) Dataset and Community Challenges

Published 20 Oct 2020 in cond-mat.mtrl-sci and cs.LG | (2010.09990v5)

Abstract: Catalyst discovery and optimization is key to solving many societal and energy challenges including solar fuels synthesis, long-term energy storage, and renewable fertilizer production. Despite considerable effort by the catalysis community to apply machine learning models to the computational catalyst discovery process, it remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbate identity/configurations, perhaps because datasets have been smaller in catalysis than related fields. To address this we developed the OC20 dataset, consisting of 1,281,040 Density Functional Theory (DFT) relaxations (~264,890,000 single point evaluations) across a wide swath of materials, surfaces, and adsorbates (nitrogen, carbon, and oxygen chemistries). We supplemented this dataset with randomly perturbed structures, short timescale molecular dynamics, and electronic structure analyses. The dataset comprises three central tasks indicative of day-to-day catalyst modeling and comes with pre-defined train/validation/test splits to facilitate direct comparisons with future model development efforts. We applied three state-of-the-art graph neural network models (CGCNN, SchNet, Dimenet++) to each of these tasks as baseline demonstrations for the community to build on. In almost every task, no upper limit on model size was identified, suggesting that even larger models are likely to improve on initial results. The dataset and baseline models are both provided as open resources, as well as a public leader board to encourage community contributions to solve these important tasks.

Abstract PDF Upgrade to Chat

Authors (17)

First 10 authors:

Citations (438)

View on Semantic Scholar

Summary

The paper presents the OC20 dataset, containing over 1.28 million DFT calculations, to enable advanced catalyst design.
It employs robust methodologies including molecular dynamics and graph neural network baselines for accurate energy and structure predictions.
Key implications highlight the potential to bridge ML predictive gaps and accelerate renewable energy through improved catalyst discovery.

An Evaluation of the Open Catalyst 2020 Dataset and its Implications

The manuscript details the creation and utility of the Open Catalyst 2020 (OC20) benchmark, a substantial dataset designed to accelerate the development of ML models for the field of heterogeneous catalysis. The authors emphasize the critical role of catalyst optimization in advancing renewable energy technology, highlighting the challenges of generating ML models that can generalize across diverse elemental compositions and molecular interactions due to the traditionally limited size of available datasets. This is contrasted with OC20, which introduces a much larger and diverse collection of catalyst-related data.

Dataset Composition and Methodology

OC20 comprises over 1.28 million Density Functional Theory (DFT) relaxation calculations, generating a staggering 264.89 million single-point evaluations across a comprehensive range of materials, surfaces, and adsorbates involving nitrogen, carbon, and oxygen chemistry. The dataset includes not only equilibrium configurations but also perturbed and molecular dynamics data to enhance its robustness and coverage. The considerable variety in adsorbates and material compositions ensures an exhaustive exploration of possible catalyst behaviors, moving beyond the limited scope of previous literature datasets.

Baseline Model and Evaluation

Three prominent graph neural network models—CGCNN, SchNet, and DimeNet $++$ —serve as starting baselines for predictive modeling using OC20. The authors have astutely chosen these approaches as they represent state-of-the-art techniques capable of navigating the intricacies of atomic-scale interactions. Across the proposed tasks, improvements from these baselines are anticipated with larger model sizes and more extensive datasets. Notably, there exists a significant gap in moving from theoretical to practical utility indicated by the low achievement in meeting desired accuracy thresholds under the Energy and Forces within Threshold (EFwT) metric.

Key Contributions and Implications

A primary contribution of the OC20 dataset, beyond mere size, is its facilitation of benchmark tasks that mimic actual catalysis applications such as predicting relaxed energies and structures, which are pivotal for computational efficiency in catalyst discovery. The evaluation tasks—predicting energies and forces, identifying relaxed structures, and estimating relaxed energies directly from initial states—set comprehensive challenges for ML developers, aiming to emulate realistic chemical environments and achieve rapid iterations over design ideas.

Strategically, the dataset cultivates an environment where the community can apply big-data-driven techniques on a scale previously unattainable in catalysis. Moving forward, the prominence of scaling laws identified within this work implies that remarkable strides in ML model architecture are necessary to push past current predictive limits, indicating that pure dataset expansion is insufficient without innovating model representations and optimizations.

Future Directions

The OC20 endeavor points towards future work in preemptive and synergistic datasets, potentially incorporating more intricate systems, including dynamic behavior under varied reaction conditions. Advances in model interpretation and transfer learning from this and similar datasets could lead to more nuanced insights into both catalyst materials science and broader inorganic-organic material interactions. There is a clear opportunity to intersect these findings with experimental domains to validate and drive further hypothesis in catalysis research.

In conclusion, OC20 is set to provide a fertile ground for catalysis research, with implications for both ML development and practical applications. This benchmark prepares the groundwork for not only advancing theoretical models but also tangibly translating those models into actionable catalysts in industrial contexts. The paper marks a noteworthy step in bridging the gap between ML potential and catalytic application reality, inviting visionary improvements and discoveries in catalyst design.

Markdown Report Issue