Open Catalyst 2025 (OC25) Dataset
- OC25 is a comprehensive dataset featuring 7,801,261 DFT calculations across explicit solvent and ion environments to benchmark ML in catalysis.
- It incorporates off-equilibrium geometries and high-quality DFT force labels, achieving energy MAEs as low as 0.060 eV for enhanced model performance.
- OC25 facilitates realistic simulations of solid–liquid interfaces, advancing catalyst discovery and energy storage research with expanded chemical diversity.
The Open Catalyst 2025 (OC25) dataset is a large-scale, open-access resource designed to accelerate the development of ML models for simulating catalytic processes at solid–liquid interfaces. Building on the foundation of previous Open Catalyst datasets (OC20 and OC22), OC25 addresses key gaps by incorporating explicit representations of solvents and ions, expanded chemical diversity, and off-equilibrium geometries. OC25 establishes a new benchmark for interatomic potential development in heterogeneous catalysis and energy storage research (Sahoo et al., 22 Sep 2025).
1. Dataset Scope and Composition
OC25 consists of 7,801,261 single-point density functional theory (DFT) calculations performed across 1,511,270 unique explicit solvent environments. This resource covers:
- Elemental Diversity: 88 unique elements, including a broad range of transition metals, main group elements, and oxide-forming species.
- Solvent and Ion Representation: 8 commonly used solvents (predominantly water but also including various organics) and 9 different ion types (cations and anions, spanning a range of sizes and charges).
- System and Geometric Diversity: Configurations average 144 atoms per system, with solvent layers systematically varied (typically 5–10 layers, average 5.6). In total, 98 distinct adsorbates are represented, including both those found in OC20 and new reactive intermediates.
- Off-Equilibrium Sampling: Many configurations are generated by brief high-temperature (∼1000K) molecular dynamics (MD) simulations to ensure sampling of force-distributed, off-equilibrium states. This approach reduces redundancy from exclusively relaxed structures and promotes ML model robustness.
OC25 is currently the most comprehensive and diverse dataset available for studying solid–liquid catalytic interfaces (Sahoo et al., 22 Sep 2025).
2. Model Benchmarks and ML Performance
State-of-the-art graph neural network (GNN) baselines trained on OC25 demonstrate significant improvements in multiple properties relevant to catalyst modeling:
Model | Energy MAE (eV) | Force MAE (eV/Å) | Solvation Energy MAE (eV) |
---|---|---|---|
eSEN-S-cons. | 0.105 | 0.015 | 0.08 |
eSEN-M-d. | 0.060 | 0.009 | 0.04 |
UMA-S-1.1 | 0.170 | 0.027 | 0.13 |
The eSEN-M-d. model, a scaled-up variant, achieves the lowest errors (energy MAE: 0.060 eV; force MAE: 0.009 eV/Å; solvation energy MAE: 0.04 eV) on the OC25 Test split, outperforming prior Universal Models for Atoms (UMA-OC20)—especially for solvation energies and force predictions. Both energy-conserving and direct-force approaches perform robustly on OC25, reflecting the underlying diversity and complexity of the dataset (Sahoo et al., 22 Sep 2025).
3. Scientific and Methodological Advances
OC25 fundamentally advances the atomistic modeling of catalysis in several ways:
- Explicit Solvent and Ion Effects: By including detailed solvent/ion environments, OC25 enables simulation of interfacial phenomena such as solvation, electric double layers, and ion-mediated surface processes that are inaccessible in gas-phase datasets like OC20 or OC22.
- Combinatorial Chemistry: The combination of expanded adsorbate, substrate, solvent, and ionic conditions substantially increases the coverage of catalytically relevant reactions.
- Off-Equilibrium Force Sampling: The intentional inclusion of high-temperature MD-generated geometries ensures a broader and more informative sampling of the potential energy surface, improving ML generalization and transferability.
- High-Quality DFT Force Labels: DFT properties are labeled using tight electronic convergence criteria (EDIFF=10⁻⁴ eV for training, 10⁻⁶ eV for validation/test), with force “drift” outliers (>1 eV/Å) excluded. Models trained even on moderately noisy force labels exhibit resilience and maintain high test accuracy.
An important metric introduced is the pseudo solvation energy:
where and are adsorption energies in solvated and vacuum environments, respectively. This metric quantifies the solvent influence on adsorbate binding.
4. Implications for Catalyst Screening and Energy Applications
OC25 is purpose-built to enable:
- Accurate, Long-Timescale Simulations: The scale and chemical realism of OC25, combined with model accuracy (energy/force MAEs below 0.1 eV and 0.015 eV/Å), facilitate MD simulations of interfacial phenomena over extended periods and system sizes that were previously computationally impractical.
- Realistic Modeling of Solid–Liquid Interfaces: Practical catalysis—in electrochemical cells or environmental systems—occurs at solid–liquid interfaces, where solvent and ion effects are critical determinants of functionality. OC25 advances the field by providing data-driven models capable of addressing these complexities directly.
- Improved Catalyst Discovery Pipelines: High accuracy and transferability foster high-throughput screening, reducing the reliance on costly DFT relaxations and improving candidate selection for experimental validation.
A plausible implication is that, as with OC20, “proxy” subsets or distilled models (cf. OC-2M findings from previous work) may facilitate rapid development cycles and tuning for domain-specific applications (Gasteiger et al., 2022).
5. Technical and Computational Considerations
OC25’s size and diversity necessitate:
- Memory and Compute Requirements: Training baseline models to convergence on OC25 involves significant GPU resources, in line with trends observed for OC20 (hundreds of GPUs for large-scale jobs). However, architectural choices such as using nearest-neighbor or Voronoi-based graph construction (see (Korovin et al., 2022)) and efficient interaction hierarchies (cf. GemNet-OC (Gasteiger et al., 2022)) can mitigate overhead.
- Data Redundancy and Filtering: To optimize learning efficiency and label quality, redundant fully relaxed configurations are minimized and DFT drift thresholds strictly enforced.
- Flexible Model Architectures: Both energy-conserving (force-derived) and direct-prediction models are viable; architectural and hyperparameter choices should be tailored to the dataset (as demonstrated by contrasting performance of nearest-neighbor graphs, Gaussian vs. Bessel radial functions, and the inclusion of off-equilibrium geometries).
The dataset and all model baselines are openly accessible via HuggingFace and GitHub, thereby supporting broad reproducibility and further benchmarking efforts (Sahoo et al., 22 Sep 2025).
6. Future Directions and Research Challenges
OC25 highlights several ongoing research directions:
- Generalizability and Universal Models: The challenge of accurately transferring models across broad chemical and configurational space persists; model choices may have strongly dataset-dependent impacts (as shown in (Gasteiger et al., 2022)).
- Uncertainty Quantification and Active Learning: There is continued need for rigorous uncertainty metrics and active model–experiment feedback to guide further data generation (Kolluru et al., 2022).
- Physics-Based Descriptors: Integration of additional physical features—such as charge densities or orbital occupations—may further improve transferability and model interpretability in forthcoming datasets.
- Interfacial Reaction Mechanisms: The unique capacity of OC25 to probe solvent/ion-induced effects enables mechanistic studies of charge transfer, electric field effects, and double layer dynamics that were previously inaccessible in large benchmark datasets.
This suggests OC25 will serve as both a benchmark and springboard for next-generation ML potentials and simulation methodologies relevant to energy conversion, storage, and sustainable chemical manufacturing.
7. Data Access and Community Involvement
OC25 is released under an open-access policy, with both the raw dataset and all baseline models/code available for immediate use and further extension. The primary data repository is hosted on HuggingFace, and source code for training and evaluation is accessible via GitHub under github.com/facebookresearch/fairchem. This infrastructure is intended to promote widespread participation and accelerate progress in ML-driven catalyst discovery and solid–liquid interface modeling (Sahoo et al., 22 Sep 2025).