Open Catalyst 2020 (OC20)
- OC20 is a large-scale dataset with over 1.2 million DFT relaxations and diverse catalyst-adsorbate configurations designed to accelerate ML-driven discovery in sustainable catalysis.
- It benchmarks three key ML tasks—structure to energy/forces, initial-to-relaxed structure, and initial-to-relaxed energy—using graph neural network architectures.
- OC20 facilitates the development of universal ML interatomic potentials by reducing reliance on expensive DFT computations and enabling rapid catalyst screening.
The Open Catalyst 2020 (OC20) dataset is a large-scale computational resource developed by the Open Catalyst Project (OCP), a collaborative effort led by Facebook AI Research (FAIR) and Carnegie Mellon University, in order to accelerate ML-driven discovery of heterogeneous catalysts for applications in energy conversion, storage, and sustainable chemical production. OC20 defines benchmark tasks, provides pre-structured splits, and has catalyzed advances in graph neural network (GNN) modeling for catalysis by supplying more than an order of magnitude increase in dataset scale and chemical diversity, compared to prior work. The dataset and challenge framework have provided the basis for developing universal ML interatomic potentials capable of replacing or complementing expensive density functional theory (DFT) calculations in high-throughput catalyst screening, surface–adsorbate energy/geometry prediction, and data-driven mechanistic modeling tasks (Chanussot et al., 2020).
1. Objectives, Scope, and Motivation
The impetus behind OC20 stems from the imperative to discover and optimize new catalytic materials that underpin crucial processes such as ammonia synthesis, renewable hydrogen production, and CO₂ utilization. Traditional DFT relaxations constitute a computational bottleneck for high-throughput screening, as accurate determination of ground-state geometries and adsorption energies for a combinatorially vast space of catalysts and intermediates requires ∼O(103–106) ab initio relaxations per project, each scaling cubically or worse with system size. OC20 aims to overcome this by supplying (1) a dataset with >1.2 million DFT relaxations—corresponding to ~265 million atomic environments, (2) challenge tasks requiring generalization across both adsorbates and catalyst compositions, and (3) a consistent testbed for developing graph-based ML surrogates to DFT (Chanussot et al., 2020). The extended motivation for learned force fields and rapid ground-state search workflows is further established in (Schaarschmidt et al., 2022).
2. Dataset Composition and Properties
OC20 is structured to maximize both compositional and configurational diversity while remaining within the chemical regimes most relevant to sustainable catalysis and energy storage:
- Elements and Materials: 55 elements (primarily transition metals but also including main-group elements), 11,451 bulk compositions encompassing unary, binary, and ternary surfaces, and systematically enumerated low-Miller-index facets.
- Adsorbates: 82 unique adsorbate species including H, C₁, C₂, N, O containing molecules and radicals.
- Scale: 1,281,040 DFT relaxations, approximately 264,890,000 single-point DFT energy/force evaluations, and augmented off-equilibrium state data via rattling and ab initio molecular dynamics.
- Augmentation and Diversity: The dataset includes both "rattled" structures (20% of steps, σ=0.05 Å Gaussian displacements) and hundreds of thousands of high-temperature MD snapshots, broadening sampled PES regions beyond equilibrium minima.
- Data Organization: All data is organized into training, validation, and held-out test splits, further divided into "In-Domain" (ID), "Out-Of-Domain" Adsorbate (OOD Ads), "OOD" Catalyst (OOD Cat), and the most challenging combined "OOD Both" splits. Test set ground truth is hidden for leaderboard evaluation.
This scale, combined with rigorous structuring, enables system-agnostic ML models to be explicitly benchmarked for interpolation and extrapolation performance, a key requirement for universal catalyst potentials (Chanussot et al., 2020).
3. Benchmark Tasks and Evaluation Protocols
OC20 formalizes three central ML tasks for catalyst modeling that mirror DFT-driven workflows:
- S2EF ("Structure to Energy and Forces"): Given the atomic structure (positions, identities) of a slab+adsorbate system, predict the total energy and per-atom forces. Target loss:
- IS2RS ("Initial Structure to Relaxed Structure"): Predict final, DFT-relaxed positions for a given unrelaxed slab+adsorbate configuration. Metrics include Average Distance within Threshold (ADwT) and Forces-below-Threshold (FbT).
- IS2RE ("Initial Structure to Relaxed Energy"): Predict the relaxed structure's total adsorption energy directly from the initial (unrelaxed) coordinates. Evaluated by MAE and fraction of cases within an absolute energy threshold.
The tasks are designed to test not only in-distribution interpolation, but also extrapolation to entirely new adsorbates, catalysts, or both, challenging ML models to develop robust representations of chemistry and structure (Chanussot et al., 2020).
4. Graph Neural Network Baselines and Methodological Advances
OC20 released consistent baselines in the form of GNN architectures that map periodic, 3D systems into graph inputs with:
- Node Features: Atomic number, electronegativity, periodic group, period.
- Edge Features: Interatomic distances expanded via Gaussian radial bases; adjacency determined via a 6 Å cutoff.
- Model Classes:
- CGCNN: Standard message passing on undirected edges with scalar node/edge embeddings.
- SchNet: Continuous-filter convolution, enabling differentiation of energy for force prediction, enforcing energy conservation.
- DimeNet++: Angular features included via triplet messaging, enhancing sensitivity to directionality and local geometry.
Initial results showed that DimeNet++ provided lowest force MAEs (0.044–0.046 eV/Å), with substantial error increases in OOD settings. Importantly, no saturation of model capacity was observed as parameters and training data increased, indicating both the practical need for massive datasets and the high sample complexity posed by catalysis tasks (Chanussot et al., 2020).
Subsequent work has demonstrated that larger and more sophisticated models (e.g., GemNet-OC, equivariant GNNs, SE(3)-GNNs), energy-conserving vs direct-force targets, and data-driven augmentation (e.g., off-equilibrium MD sampling) further lower errors and expand transferable regimes (Kolluru et al., 2022, Schaarschmidt et al., 2022, Geitner, 5 Apr 2024).
5. Workflow Integration and ML-Driven Relaxation Protocols
Real-world catalyst discovery requires both accurate energies/forces and efficient geometry optimization. Recent research established that learned force fields can serve as drop-in replacements for DFT in relaxation workflows, provided that the model's PES minima align with those of reference DFT:
- Architecture: Use of GNS-like GNNs, joint energy/force losses, cutoff-based neighbor graphs, and decoders for node-wise force/energy readouts.
- Optimizers: Relaxations implemented via AdamW on positions, with force convergence criteria set for both ML-FF (0.005 eV/Å) and auxiliary "Easy Potential" quadratic surrogates.
- Efficiency: ML-FF relaxations exhibit linear scaling, yielding median relaxation times ∼1 min on commodity GPU hardware for ∼200-atom systems, versus hours for standard DFT on CPU clusters.
- Hybrid Protocols: The two-stage Easy Potential→ML-FF further accelerates convergence and increases the success fraction (within 0.1 eV or lower energy vs. DFT) to nearly 89%. Hybrid protocols combining ML-FF relaxation with subsequent DFT refinement balance speed with DFT-level accuracy.
Empirically, over 70% of ML-FF relaxed structures are within 0.1 eV of DFT relaxations (or lower in energy), demonstrating practical readiness for deployment in high-throughput screening (Schaarschmidt et al., 2022).
6. Generalization, Limitations, and Methodological Challenges
Despite substantial progress, OC20-centered ML potentials face explicit limitations:
- Generalization Deficits: Notable underperformance exists for nonmetal surfaces, large/bidentate or polyoxygenated adsorbates, and chemistries absent from OC20 training data; error increases by 30–50% in such regimes (Kolluru et al., 2022).
- Data Efficiency and Uncertainty: Model improvement with increased data is sublinear (MAE scaling exponent α~0.1–0.2 for OC20 tasks), so order-of-magnitude increases in dataset size only marginally impact chemical accuracy.
- Energy Conservation: Energy-conserving forces are preferable for physical consistency and MD, but direct-force architectures tend to outperform in joint ML-optimization workflows on OC20 (Kolluru et al., 2022).
- Off-Equilibrium Sampling: Augmentation with physically relevant MD trajectories enhances force/structure prediction; random "rattling" is less effective.
- Extrapolation Risks: ML potentials are only reliable near training-set PES minima; excursions into unknown strain or chemistry regimes may yield unphysical predictions, necessitating continued DFT validation (Schaarschmidt et al., 2022).
- Benchmark Limitations: OC20 DFT minima are not guaranteed global minima; benchmarking on single-point DFT energies of ML-relaxed structures remains necessary.
7. Impact, Extensions, and Future Directions
The OC20 initiative has had a transformative impact on computational catalysis, providing the canonical dataset and computational protocols for ML-driven catalyst modeling:
- Community Challenge Infrastructure: A central leaderboard (http://open-catalyst.org), open-source code, and extensive documentation have established a reproducible and extensible platform for global research.
- Downstream Applications: OC20-trained GNNs form the backbone of workflows for massive ML-accelerated DFT screening campaigns, rapid ground-state structure search, and microkinetic network generation. These models also provide off-the-shelf surrogates for transition-state search and thermochemistry via ML-generated Hessians (Wander et al., 2 Oct 2024).
- Expansion: OC20 methodologies and concepts directly motivated subsequent datasets—OC22 (oxides, spin-polarization, total energy tasks), OC25 (explicit solid–liquid interface, solvation/ion effects), and experimental integration/validation challenges (Tran et al., 2022, Sahoo et al., 22 Sep 2025, Abed et al., 18 Nov 2024).
- Ongoing Research: Active learning, uncertainty quantification, hybrid ML–DFT schemes, and incorporation of electronic and environmental features (e.g., charge densities, solvation, electric fields) remain open problems. The move toward data fusion, transfer learning, and integration with experimental benchmarks is ongoing.
OC20, as both dataset and ecosystem, serves as the foundational platform for data-driven catalyst discovery, benchmarking, and methodological innovation in computational materials science (Chanussot et al., 2020, Kolluru et al., 2022, Schaarschmidt et al., 2022).