Open Catalyst 2020: ML-Driven Catalyst Discovery

Updated 20 January 2026

Open Catalyst 2020 is a large-scale dataset offering diverse, high-fidelity DFT data for heterogeneous catalyst–adsorbate systems to advance machine learning in catalysis.
It establishes standardized benchmarking tasks (S2EF, IS2RS, IS2RE) with specific metrics such as MAE and force thresholds to evaluate graph neural network models.
OC20 accelerates catalyst discovery workflows by reducing reliance on costly DFT calculations while exposing challenges in achieving chemical accuracy and robust generalization.

Open Catalyst 2020 (OC20) is a large-scale dataset and benchmarking effort that catalyzed progress in applying machine learning to heterogeneous catalyst discovery. OC20 offers high-fidelity Density Functional Theory (DFT) data for a diverse set of catalyst–adsorbate systems, enabling the development, evaluation, and scaling of graph neural network (GNN) models for tasks central to computational catalysis. It codifies key challenges and provides public baselines, leaderboards, and extensible data resources to foster reproducible research and accelerate automated materials design (Chanussot et al., 2020, Kolluru et al., 2022).

1. Motivation and Historical Context

Catalyst discovery is fundamental to energy conversion processes such as solar fuels synthesis, CO₂ reduction, and ammonia synthesis. Traditional approaches rely on the computationally intensive application of DFT for screening candidate catalysts and reaction intermediates—efforts limited by exponential combinatorial complexity and the high cost of DFT calculations. Previous datasets for catalysis were typically orders of magnitude smaller and lacked the chemical diversity required to train generalizable ML models (Chanussot et al., 2020).

OC20 was launched to break this bottleneck, assembling over 1.28 million DFT relaxations and nearly 265 million single-point energy and force evaluations (§1, (Chanussot et al., 2020)). This scale and diversity underpin a new paradigm for surrogate modeling in catalysis, enabling the training of large, transferable graph-based models for both static and dynamical property prediction.

2. Dataset Composition and Scope

OC20 comprises 1,281,040 DFT structure relaxations, 264,890,000 single-point evaluations, 11,451 crystalline surfaces (low-Miller-index facets, h,k,l≤2), and 82 chemically relevant adsorbates with C, N, and O composition. Catalysts span unary, binary, and ternary materials, representing 55 chemical elements. The dataset includes randomly perturbed structures (“rattled” geometries), short-timescale ab initio molecular dynamics (MD) trajectories, and electronic analyses (Bader charges, projected DOS, and COHP) (Chanussot et al., 2020). All calculations were performed with VASP, RPBE functional settings, and strict convergence criteria (max force <0.03 eV/Å).

Data augmentation covers 20% randomly displaced images, ~950,000 MD snapshots at 900 K, and electronic structure analyses for relaxed/MD/rattled configurations.

3. Benchmark Tasks and Evaluation Metrics

OC20 defines three canonical machine learning tasks corresponding to practical steps in catalyst modeling (Chanussot et al., 2020, Kolluru et al., 2022):

Structure-to-Energy-and-Forces (S2EF): Predict total energy $E$ and per-atom forces $F$ from structures $S_0$ . Evaluated by mean absolute error (MAE) for energy and forces, force cosine similarity, and “Energy & Forces within Threshold” (EFwT: fraction of structures with $|\Delta E| < 0.02$ eV, $|\Delta F| < 0.03$ eV/Å).
Initial-Structure-to-Relaxed-Structure (IS2RS): Predict DFT-relaxed atomic coordinates from unrelaxed input. Assessed via “Average Distance within Threshold” (ADwT), “Forces below Threshold” (FbT: fraction with max force $< 0.05$ eV/Å and position MAE <$0.5$ Å), and AFbT (fraction under various force thresholds).
Initial-Structure-to-Relaxed-Energy (IS2RE): Directly predict adsorption energy of the relaxed configuration, either via regression from the initial structure or relaxation pipeline using ML-predicted forces. Metrics: MAE on adsorption energy, “Energies within Threshold” (EwT; $|\Delta E| < 0.02$ eV).

Strategic train/validation/test splits partition data by adsorbates, catalysts, and both (OOD Ads, OOD Cat, OOD Both), securing proper generalization assessment.

4. Modeling Approaches and Baselines

OC20 standardized reference architectures for graph neural network modeling of catalyst systems:

Model	Parameters	MAE (Energy, eV)	MAE (Forces, eV/Å)	Key Features
CGCNN	~3.6M	0.51	0.068	Crystal graph CNN, Gaussian-smearing edge features
SchNet	~7.4M	0.44	0.049	Continuous-filter convolution, radial basis, ∂E/∂R for forces
DimeNet++	~1.8M	0.36	0.031	Directional MP, radial and spherical harmonics

All architectures operate on periodic radius graphs (cutoff 6 Å), with edge and node features encompassing chemical and geometric descriptors (Chanussot et al., 2020, Geitner, 2024, Korovin et al., 2022).

Relaxation-based pipelines (“force-only + energy-only” model composition) outperform direct regression in IS2RE, yielding best-in-class performance near 0.50 eV MAE and ~6% EwT (Chanussot et al., 2020).

Scaling model size reliably trends toward lower MAEs but saturates for OOD generalization and fails to reach the desired ~0.02 eV chemical accuracy (Kolluru et al., 2022).

5. Advances in Physically-Informed and Efficient GNNs

Subsequent work sought relaxation pipelines and physical priors to improve accuracy and computational efficiency. Voronoi-tessellation graphs replace arbitrary cutoff graphs, enriching edge features with contact solid angles and flags for direct/indirect atom contact. Node features are augmented with Voronoi volumes and chemical properties (electronegativity, group, period) (Korovin et al., 2022).

Korovin et al. demonstrated that these enhancements yield a MAE(IS2RE) of 651 meV/atom on OC20 and 20 meV/atom on compositionally homogeneous intermetallics—approaching “chemical accuracy” for the latter (Korovin et al., 2022).

Efforts to democratize model development introduced sub-5M-parameter architectures (GemNet-Mini, MPGNN-Tiny), using geometric/symmetric message passing. These models achieve near-state-of-the-art force prediction accuracy (e.g., GemNet-Mini: 0.0748 eV/Å versus DimeNet++: 0.0741 eV/Å), with dramatic reductions in compute cost and training time (Geitner, 2024).

Supervision schemes such as DR-Label (deconstructing node-level targets into edge-wise projections, then reconstructing via sphere fitting) were shown to increase robustness, remove solution multiplicity, and deliver state-of-the-art IS2RE performance (DRFormer: 0.4509 eV MAE, AEwT 6.48%) (Wang et al., 2023).

6. Impact on Catalyst Discovery and Computational Workflows

Learned GNN force fields (ML-FFs), trained on OC20, can directly replace DFT relaxations for a majority of catalyst–adsorbate systems. Empirical outcomes show ML-FF relaxations yield similar or lower DFT energies than RPBE-relaxed geometries in over 50% of cases, with force errors ≈0.03 eV/Å and scaling linearly with atom count (Schaarschmidt et al., 2022). Hybrid pipelines, e.g., combining locally harmonic “Easy Potentials” with ML-FF, achieve convergence in half the steps and further reduce final energies. Runtime per structure drops to ~1 minute on GPU, contrasted with hours for DFT (Schaarschmidt et al., 2022).

Emergent applications include high-throughput kinetic screening and accelerated reaction network mapping via ML-accelerated Nudged Elastic Band (NEB) calculations. CatTSunami, using OC20-pretrained Equiformer v2, achieves transition state energy prediction within 0.1 eV parity to DFT across 91% of cases (28× speedup). Large reaction networks (e.g., 174 CO-hydrogenation dissociations) can be exhaustively enumerated at near-DFT resolution with 1500× cost savings (Wander et al., 2024).

Routine ML integration with workflows such as AdsorbML and Catlas now feasibly underpins reaction mechanism determination, microkinetic modeling, and active learning on underrepresented configurations (Wander et al., 2024).

7. Limitations and Open Research Directions

OC20 set benchmarks for representation, model generalization, and scalability; however, several challenges persist (Kolluru et al., 2022):

Generalization: Errors on nonmetal slabs, N/O/bidentate adsorbates, and OOD splits remain 1.5–2× higher than in-domain or simple monodentate systems.
Accuracy: Typical energy MAEs are an order of magnitude above “chemical accuracy”; practical force/geometry thresholds succeed only for a minority of cases.
Modeling trade-offs: Energy-conserving forces (F = –∇E) offer physical guarantees but incur a 2–4× overhead versus direct force prediction.
Data augmentation: The impact of MD and random perturbations on OOD robustness and accuracy is system-dependent.
Uncertainty and active learning: Systematic methods to quantify model confidence and optimize data enrichment are underdeveloped.
Physics-informed advances: Incorporation of electronic structure features and hybrid architectures (OrbNet, UNiTE) promise accelerated OOD generalization but are not yet routine.

The ongoing effort is now coupled to experimental validation and industrially-relevant performance benchmarking (OCX24), bridging the gap between DFT descriptors and real cell voltages/product rates (Abed et al., 2024). Discovery workflows increasingly aim at fully AI-driven catalyst design, underpinned by large-scale experimental and computational datasets.

OC20 is a pivotal resource driving advances in scalable, generalizable ML models for heterogeneous catalysis, transforming both the feasibility and methodology of computational catalyst discovery through open, extensible data and standardized benchmarks (Chanussot et al., 2020, Kolluru et al., 2022, Wander et al., 2024).