Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 161 tok/s Pro
GPT OSS 120B 412 tok/s Pro
Claude Sonnet 4 35 tok/s Pro
2000 character limit reached

Open Catalyst 2025 (OC25) Dataset

Updated 23 September 2025
  • OC25 is a comprehensive dataset featuring 7,801,261 DFT calculations across explicit solvent and ion environments to benchmark ML in catalysis.
  • It incorporates off-equilibrium geometries and high-quality DFT force labels, achieving energy MAEs as low as 0.060 eV for enhanced model performance.
  • OC25 facilitates realistic simulations of solid–liquid interfaces, advancing catalyst discovery and energy storage research with expanded chemical diversity.

The Open Catalyst 2025 (OC25) dataset is a large-scale, open-access resource designed to accelerate the development of ML models for simulating catalytic processes at solid–liquid interfaces. Building on the foundation of previous Open Catalyst datasets (OC20 and OC22), OC25 addresses key gaps by incorporating explicit representations of solvents and ions, expanded chemical diversity, and off-equilibrium geometries. OC25 establishes a new benchmark for interatomic potential development in heterogeneous catalysis and energy storage research (Sahoo et al., 22 Sep 2025).

1. Dataset Scope and Composition

OC25 consists of 7,801,261 single-point density functional theory (DFT) calculations performed across 1,511,270 unique explicit solvent environments. This resource covers:

  • Elemental Diversity: 88 unique elements, including a broad range of transition metals, main group elements, and oxide-forming species.
  • Solvent and Ion Representation: 8 commonly used solvents (predominantly water but also including various organics) and 9 different ion types (cations and anions, spanning a range of sizes and charges).
  • System and Geometric Diversity: Configurations average 144 atoms per system, with solvent layers systematically varied (typically 5–10 layers, average 5.6). In total, 98 distinct adsorbates are represented, including both those found in OC20 and new reactive intermediates.
  • Off-Equilibrium Sampling: Many configurations are generated by brief high-temperature (∼1000K) molecular dynamics (MD) simulations to ensure sampling of force-distributed, off-equilibrium states. This approach reduces redundancy from exclusively relaxed structures and promotes ML model robustness.

OC25 is currently the most comprehensive and diverse dataset available for studying solid–liquid catalytic interfaces (Sahoo et al., 22 Sep 2025).

2. Model Benchmarks and ML Performance

State-of-the-art graph neural network (GNN) baselines trained on OC25 demonstrate significant improvements in multiple properties relevant to catalyst modeling:

Model Energy MAE (eV) Force MAE (eV/Å) Solvation Energy MAE (eV)
eSEN-S-cons. 0.105 0.015 0.08
eSEN-M-d. 0.060 0.009 0.04
UMA-S-1.1 0.170 0.027 0.13

The eSEN-M-d. model, a scaled-up variant, achieves the lowest errors (energy MAE: 0.060 eV; force MAE: 0.009 eV/Å; solvation energy MAE: 0.04 eV) on the OC25 Test split, outperforming prior Universal Models for Atoms (UMA-OC20)—especially for solvation energies and force predictions. Both energy-conserving and direct-force approaches perform robustly on OC25, reflecting the underlying diversity and complexity of the dataset (Sahoo et al., 22 Sep 2025).

3. Scientific and Methodological Advances

OC25 fundamentally advances the atomistic modeling of catalysis in several ways:

  • Explicit Solvent and Ion Effects: By including detailed solvent/ion environments, OC25 enables simulation of interfacial phenomena such as solvation, electric double layers, and ion-mediated surface processes that are inaccessible in gas-phase datasets like OC20 or OC22.
  • Combinatorial Chemistry: The combination of expanded adsorbate, substrate, solvent, and ionic conditions substantially increases the coverage of catalytically relevant reactions.
  • Off-Equilibrium Force Sampling: The intentional inclusion of high-temperature MD-generated geometries ensures a broader and more informative sampling of the potential energy surface, improving ML generalization and transferability.
  • High-Quality DFT Force Labels: DFT properties are labeled using tight electronic convergence criteria (EDIFF=10⁻⁴ eV for training, 10⁻⁶ eV for validation/test), with force “drift” outliers (>1 eV/Å) excluded. Models trained even on moderately noisy force labels exhibit resilience and maintain high test accuracy.

An important metric introduced is the pseudo solvation energy:

ΔEsolv=ΔEads(solv)ΔEads(vac)\Delta E_\mathrm{solv} = \Delta E_\mathrm{ads}^{(\mathrm{solv})} - \Delta E_\mathrm{ads}^{(\mathrm{vac})}

where ΔEads(solv)\Delta E_\mathrm{ads}^{(\mathrm{solv})} and ΔEads(vac)\Delta E_\mathrm{ads}^{(\mathrm{vac})} are adsorption energies in solvated and vacuum environments, respectively. This metric quantifies the solvent influence on adsorbate binding.

4. Implications for Catalyst Screening and Energy Applications

OC25 is purpose-built to enable:

  • Accurate, Long-Timescale Simulations: The scale and chemical realism of OC25, combined with model accuracy (energy/force MAEs below 0.1 eV and 0.015 eV/Å), facilitate MD simulations of interfacial phenomena over extended periods and system sizes that were previously computationally impractical.
  • Realistic Modeling of Solid–Liquid Interfaces: Practical catalysis—in electrochemical cells or environmental systems—occurs at solid–liquid interfaces, where solvent and ion effects are critical determinants of functionality. OC25 advances the field by providing data-driven models capable of addressing these complexities directly.
  • Improved Catalyst Discovery Pipelines: High accuracy and transferability foster high-throughput screening, reducing the reliance on costly DFT relaxations and improving candidate selection for experimental validation.

A plausible implication is that, as with OC20, “proxy” subsets or distilled models (cf. OC-2M findings from previous work) may facilitate rapid development cycles and tuning for domain-specific applications (Gasteiger et al., 2022).

5. Technical and Computational Considerations

OC25’s size and diversity necessitate:

  • Memory and Compute Requirements: Training baseline models to convergence on OC25 involves significant GPU resources, in line with trends observed for OC20 (hundreds of GPUs for large-scale jobs). However, architectural choices such as using nearest-neighbor or Voronoi-based graph construction (see (Korovin et al., 2022)) and efficient interaction hierarchies (cf. GemNet-OC (Gasteiger et al., 2022)) can mitigate overhead.
  • Data Redundancy and Filtering: To optimize learning efficiency and label quality, redundant fully relaxed configurations are minimized and DFT drift thresholds strictly enforced.
  • Flexible Model Architectures: Both energy-conserving (force-derived) and direct-prediction models are viable; architectural and hyperparameter choices should be tailored to the dataset (as demonstrated by contrasting performance of nearest-neighbor graphs, Gaussian vs. Bessel radial functions, and the inclusion of off-equilibrium geometries).

The dataset and all model baselines are openly accessible via HuggingFace and GitHub, thereby supporting broad reproducibility and further benchmarking efforts (Sahoo et al., 22 Sep 2025).

6. Future Directions and Research Challenges

OC25 highlights several ongoing research directions:

  • Generalizability and Universal Models: The challenge of accurately transferring models across broad chemical and configurational space persists; model choices may have strongly dataset-dependent impacts (as shown in (Gasteiger et al., 2022)).
  • Uncertainty Quantification and Active Learning: There is continued need for rigorous uncertainty metrics and active model–experiment feedback to guide further data generation (Kolluru et al., 2022).
  • Physics-Based Descriptors: Integration of additional physical features—such as charge densities or orbital occupations—may further improve transferability and model interpretability in forthcoming datasets.
  • Interfacial Reaction Mechanisms: The unique capacity of OC25 to probe solvent/ion-induced effects enables mechanistic studies of charge transfer, electric field effects, and double layer dynamics that were previously inaccessible in large benchmark datasets.

This suggests OC25 will serve as both a benchmark and springboard for next-generation ML potentials and simulation methodologies relevant to energy conversion, storage, and sustainable chemical manufacturing.

7. Data Access and Community Involvement

OC25 is released under an open-access policy, with both the raw dataset and all baseline models/code available for immediate use and further extension. The primary data repository is hosted on HuggingFace, and source code for training and evaluation is accessible via GitHub under github.com/facebookresearch/fairchem. This infrastructure is intended to promote widespread participation and accelerate progress in ML-driven catalyst discovery and solid–liquid interface modeling (Sahoo et al., 22 Sep 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Open Catalyst 2025 (OC25) Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube