Open Catalyst 2020 Dataset
- OC20 is a large-scale, chemically diverse dataset that accelerates catalyst discovery by providing hundreds of millions of DFT-evaluated structures.
- The dataset incorporates multiple structure generation protocols, including equilibrium relaxations, random perturbations, and short MD trajectories for varied chemical systems.
- OC20 supports ML tasks such as energy-force prediction and structure relaxation, benchmarked with rigorous splits and advanced graph neural network models.
The Open Catalyst 2020 (OC20) dataset is a large-scale, chemically diverse benchmark designed to enable and evaluate machine learning models for heterogeneous catalysis, focusing on the acceleration of density functional theory (DFT) simulations for catalyst discovery across a broad chemical and structural domain. Developed to overcome the limited generalizability of previous narrow-scope datasets, OC20 comprises hundreds of millions of DFT-calculated structures, enabling the training and rigorous assessment of general-purpose machine learning interatomic potentials (MLIPs) for energy, force, and catalytic activity predictions (Chanussot et al., 2020, Gasteiger et al., 2022, Kolluru et al., 2022, Allam et al., 27 Oct 2025).
1. Composition, Scale, and Diversity
OC20 encompasses approximately 300 million single-point DFT evaluations and over 1.28 million relaxation trajectories, each representing an adsorbate-surface system covering C, H, O, and N chemistry on an extensive set of catalysts, including metals, alloys, and oxides (Allam et al., 27 Oct 2025). The dataset was generated through systematic sampling of 55 unique chemical elements, 82 molecular adsorbates, and around 300,000 unique surface facets, yielding broad coverage in elemental composition, adsorbate types (monodentate, bidentate, and larger intermediates), and crystal orientations (Chanussot et al., 2020, Kolluru et al., 2022). Systems typically contain 7–225 atoms (mean ≈73), with slab models of varied complexity (Gasteiger et al., 2022).
Each OC20 structure is described by atom types, 3D positions, cell geometry, and slab periodic boundary conditions. Four classes of structure generation are included: equilibrium DFT relaxations, random “rattled” perturbations about minima, short ab-initio MD trajectories at elevated temperature, and transition-state–like configurations (Chanussot et al., 2020, Allam et al., 27 Oct 2025).
2. Data Generation and DFT Protocols
All OC20 data were generated using plane-wave DFT with the revised Perdew–Burke–Ernzerhof (RPBE) exchange–correlation functional. Key numerical settings include:
- Plane-wave cutoff (350 eV; higher values in successor datasets)
- Smearing width (0.2 eV)
- Non-spin-polarized calculations for nearly all systems to maximize computational throughput (Allam et al., 27 Oct 2025)
- Forces converged below 0.03 eV/Å for final relaxations
Adsorption energies are referenced to gas-phase molecules with standard thermochemical corrections (Chanussot et al., 2020). The dataset includes “rattled” structures (random atomistic displacements) and ab-initio MD snapshots for out-of-equilibrium sampling, with explicit metadata indicating structure type and configuration (Kolluru et al., 2022).
3. Tasks, Splits, and Evaluation Metrics
OC20 is organized around three core ML tasks:
- S2EF (Structure Energy & Forces): Predict total energy and per-atom forces given atomic structure.
- IS2RS (Initial Structure Relaxed Structure): Predict final relaxed geometry from an initial, unrelaxed configuration.
- IS2RE (Initial Structure Relaxed Energy): Predict the DFT-relaxed energy directly from the initial structure (Chanussot et al., 2020, Gasteiger et al., 2022, Kolluru et al., 2022).
Each task supports rigorous benchmarking across four predefined data splits—In-Domain (ID), Out-of-Distribution Adsorbate (OOD-Ads), Out-of-Distribution Catalyst (OOD-Cat), and Out-of-Distribution Both (OOD-Both)—assessing generalization to unseen chemistries and surfaces. Evaluation metrics include energy mean absolute error (MAE), force MAE, force cosine similarity, EFwT (energy and forces within threshold), DwT/ADwT (distance within threshold), and EwT (energy within threshold) (Chanussot et al., 2020, Gasteiger et al., 2022, Kolluru et al., 2022, Sriram et al., 2022).
4. Modeling Frameworks and Baseline Architectures
OC20 has become the reference benchmark for deep learning models in catalysis, particularly graph neural networks (GNNs) that encode atomic connectivity and spatial relationships. Baseline architectures include:
- CGCNN: Crystal Graph Convolutional Neural Network; uses handcrafted elemental features and edge updates via Gaussian-smeared interatomic distances.
- SchNet: Continuous-filter convolutional network with atom-wise energy predictions and analytic force consistency.
- DimeNet++: Directional message-passing GNN capturing both radial and angular information; forms the basis for several high-performing models.
- GemNet and GemNet-OC: Capture higher-order (triplet/quadruplet) interactions for improved accuracy and are scalable to very large parameter counts via graph parallelism (Gasteiger et al., 2022, Sriram et al., 2022).
These models are evaluated on the full OC20 and its subsets (notably OC-2M for rapid method development), using combined energy-force loss, domain-specific splits, and standard optimizer configurations (Adam/AdamW). Direct and gradient-derived force prediction paradigms are compared, with direct-force models empirically yielding superior performance for relaxation and structure prediction tasks (Kolluru et al., 2022).
5. Limitations, Challenges, and Extensions
While OC20 has enabled substantial increases in accuracy and generality, several limitations are recognized:
- The original OC20 prominently lacks spin polarization for 12 strongly magnetic elements (Fe, Co, Ni, Mn, V, Cr, Cu, Ru, Os, Mo, Ce, W), introducing systematic errors in binding energy predictions for these chemistries. Spin effects can shift binding energies by several tenths of an eV (Allam et al., 27 Oct 2025).
- DFT settings (ENCUT ~350 eV, coarse smearing) can produce incomplete basis convergence and smearing artifacts, particularly for non-metal surfaces or mixed-anion systems.
- Under-representation of certain chemistries (e.g., Li, Ba, La, Ce, Mg, F) and transition-state intermediates relevant for C/H/O/N catalysis, leading to weak model performance in these domains (Allam et al., 27 Oct 2025, Kolluru et al., 2022).
To address these gaps, datasets such as AQCat25 have been developed, providing 13.5 million high-fidelity DFT calculations with explicit spin-polarization and extended chemical coverage. Integration methodologies highlighted include joint (co-)training with feature-wise linear modulation (FiLM) conditioning on system-specific metadata (spin state, DFT fidelity), enabling combined models to yield reliable predictions for both original and extended domains without catastrophic forgetting (Allam et al., 27 Oct 2025).
6. Scientific and Practical Impact
OC20 has fundamentally altered catalyst ML by transforming the field from chemistry-specific potentials to universal potential energy surface (PES) surrogates, with clear benchmarks and open leaderboards facilitating community-driven progress (Chanussot et al., 2020, Kolluru et al., 2022). Models trained on OC20 (and its extensions) have demonstrated:
- Energy MAE down to ~0.35 eV and force MAE down to ~0.02 eV/Å on ID data (GemNet-OC, DimeNet++), with force cosine similarity ≥0.66 in the best large models (Sriram et al., 2022, Gasteiger et al., 2022).
- Gradual performance degradation (up to 2 error) on OOD splits, emphasizing the challenge of generalization across chemical space (Kolluru et al., 2022).
- Strong scaling behavior, with models exceeding 300M parameters providing ~15% relative reduction in force MAE compared to previously state-of-the-art models (Sriram et al., 2022).
Advances in graph parallelism and model scaling have enabled the training of billion-parameter GNNs, cementing OC20 as a cornerstone for the evaluation of foundation models in heterogeneous catalysis (Sriram et al., 2022). The dataset has also inspired methodological extensions, such as disconnected GNN variants that probe performance when explicit 3D adsorbate–surface geometry is omitted, and incorporation of electronic descriptors (Bader charges, pDOS/pCOHP) as model features (Carbonero et al., 2023, Chanussot et al., 2020).
7. Outlook and Future Directions
Ongoing research is focused on augmenting OC20 with additional data modalities and tasks—such as spin-polarized calculations, transition-state and kinetic information, charge density, and orbital-based descriptors—to systematically close domain gaps and improve model extensibility (Allam et al., 27 Oct 2025, Kolluru et al., 2022). Active learning for under-represented chemical regions, multi-task objectives (spectra, kinetics), and improved regularization (e.g., physics-informed constraints) are priority areas.
The slow scaling improvement relative to data volume on OC20 (learning exponent ≃ –0.11) underscores the need for architectural innovation and sophisticated data curation, not just dataset expansion (Gasteiger et al., 2022). The OC20 framework establishes the foundation for the next generation of ML-based catalyst discovery: scalable, accurate, and generalizable PES models for chemically diverse, industrially relevant catalytic systems.