Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models (2505.08762v1)

Published 13 May 2025 in physics.chem-ph

Abstract: Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio simulations at sizes and time scales that were previously inaccessible. However, a fundamental challenge to creating ML models that perform well across molecular chemistry is the lack of comprehensive data for training. Despite substantial efforts in data generation, no large-scale molecular dataset exists that combines broad chemical diversity with a high level of accuracy. To address this gap, Meta FAIR introduces Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory (DFT) calculations at the $\omega$B97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute. OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures. There are ~83M unique molecular systems in OMol25 covering small molecules, biomolecules, metal complexes, and electrolytes, including structures obtained from existing datasets. OMol25 also greatly expands on the size of systems typically included in DFT datasets, with systems of up to 350 atoms. In addition to the public release of the data, we provide baseline models and a comprehensive set of model evaluations to encourage community engagement in developing the next-generation ML models for molecular chemistry.

Summary

  • The paper introduces a dataset with over 100 million high-level DFT calculations covering extensive elemental, chemical, and structural diversity to improve ML interatomic potentials.
  • It employs rigorous evaluation tasks—including ligand strain, conformer prediction, and protonation energies—to benchmark advanced ML models.
  • Results reveal that while baseline models perform well on energy and force metrics, challenges remain in predicting ionization energies, spin gaps, and long-range interactions.

The paper "The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models" (2505.08762) introduces a new large-scale molecular dataset designed to address the limitations of existing datasets for training machine learning interatomic potentials (MLIPs). The primary motivation is the lack of comprehensive, high-accuracy data covering broad chemical diversity, which hinders the development of ML models that can act as reliable surrogates for Density Functional Theory (DFT) across molecular chemistry.

The OMol25 Dataset

OMol25 consists of over 100 million DFT single-point calculations performed at the high level of theory, ω\omegaB97M-V/def2-TZVPD. This dataset aims to capture the behavior of atoms across diverse chemistry domains by blending elemental, chemical, and structural diversity.

Key characteristics of the dataset:

  • Scale: Over 100 million DFT calculations.
  • Accuracy: Computed using the ω\omegaB97M-V functional with the def2-TZVPD basis set, known for its high accuracy for a broad range of quantum chemistry tasks.
  • Elemental Diversity: Includes the first 83 elements of the periodic table.
  • Chemical and Structural Diversity: Covers small molecules, biomolecules (proteins, DNA, RNA, interactions), metal complexes (transition metals, main group, lanthanides, diverse ligands), electrolytes (aqueous, non-aqueous, ionic liquids, molten salts, solvation, interfaces), and reactive structures. Includes systems with varying charge and spin states, explicit solvation, and conformers.
  • System Size: Ranges from 2 to 350 atoms, with an average of 50 atoms, significantly larger than many previous datasets.
  • Data Generation: Structures were generated using a variety of methods, including classical and MLIP-based molecular dynamics (MD), conformer sampling (e.g., CREST, RDKit, MacroModel), automated structure building (Architector for metal complexes and small molecules), reaction path generation (AFIR, Popcornn, geodesic interpolation), and recomputation of existing datasets (ANI-2X, Transition-1X, ANI-1xBB, Orbnet Denali, SPICE2, Solvated Protein Fragments, GEOM, RGD1, RMechDB, PMechDB). ML-based MD was used to explore configurations less accessible by classical methods.
  • Calculation Details: DFT calculations were performed using ORCA 6.0.0 with specific settings for integral thresholds and grids (DEFGRID3) to ensure tight consistency between energy and forces. Quality control filters were applied to remove problematic calculations (e.g., high energies/forces, unphysical S2S^2 values, convergence errors).
  • Computed Properties: Each data point includes total energy, forces, total charge, spin multiplicity, number of atoms/electrons, basis set information, convergence details, expectation value of S2S^2, and various partial charges (Mulliken, Loewdin, NBO), among others. Orbital energies, Fock matrices, and densities are planned for future release.
  • Data Splits: Divided into training (full "All" set, a \sim4M subset "4M", and a "Neutral" subset of charge-neutral singlets from community data), validation (out-of-distribution compositions), and multiple targeted out-of-distribution (OOD) test sets (OOD compositions, metal-ligand pairs, metal-containing protein structures, reactivity, experimental crystal structures from COD, unique anions/cations/solvents, TorsionNet500, Wiggle150).

Evaluations

To assess the practical utility of MLIPs trained on OMol25, the paper introduces a comprehensive suite of evaluation tasks beyond standard energy and force metrics on random splits. These tasks are designed to probe model performance on common computational chemistry objectives:

  • Protein-ligand Interaction Energy and Forces: Measures the accuracy of predicting the interaction energy and forces between a protein pocket and a ligand in a fixed geometry.
  • Ligand Strain: Evaluates the ability to predict the strain energy of a ligand's bioactive conformation relative to its global minimum energy conformer. Metrics include strain energy MAE and RMSD of predicted global minima.
  • Conformers: Assesses the capability to correctly predict the relative energies and structures of molecular conformers and identify low-energy conformers. Metrics include ensemble RMSD, Boltzmann-weighted RMSD, ΔE\Delta E MAE, and reoptimization RMSD/ΔE\Delta E.
  • Protonation Energies: Evaluates the accuracy of predicting energy differences between different protonation states of a molecule. Metrics include RMSD and ΔE\Delta E MAE for optimized structures, and reoptimization metrics.
  • Unoptimized IE/EA and Spin Gap: Measures the ability to predict vertical ionization energies, electron affinities, and spin energy gaps between different electronic states at a fixed geometry. Metrics include ΔE\Delta E MAE, ΔF\Delta \vec{F} MAE, and ΔF\Delta \vec{F} cosine similarity.
  • Distance Scaling: Probes the model's ability to capture short-range and long-range intermolecular interactions by scaling distances between molecular components in clusters. Metrics include ΔE\Delta E and ΔF\Delta \vec{F} errors for short-range (SR) and long-range (LR) distances relative to a reference structure.

Baseline Models and Results

The paper evaluates several state-of-the-art MLIPs trained on OMol25, including eSEN, GemNet-OC, and MACE (on the neutral split), serving as baselines for community comparison. These models are message-passing graph neural networks modified to accept total charge and spin as inputs via a simple embedding.

Key findings from the baseline evaluation:

  • Test Splits (Energy/Force MAE): Models trained on the full OMol25 dataset significantly outperform models trained on the 4M subset. Conserving models (gradient of energy) generally outperform direct force prediction models. Larger models (eSEN-md vs eSEN-sm) show improved performance. Performance varies across different OOD splits, with reactivity and COD (experimental structures) being the most challenging, and OOD compositions and metal-ligand pairs having lower errors.
  • Neutral Split: On the charge-neutral, singlet subset, eSEN-sm outperforms MACE in both energy and force prediction.
  • Evaluations:
    • Ligand Strain & Conformers: Models show good performance on predicting relative energies and structures of conformers, generally within chemical accuracy (\sim43 meV).
    • Protonation: More challenging than conformers, with higher ΔE\Delta E errors (26-52 meV MAE for models on the 'All' dataset), but structures are reasonably well-predicted (RMSD \leq 0.25).
    • IE/EA and Spin Gap: These tasks proved particularly challenging, with large ΔE\Delta E errors (264-597 meV MAE for models on the 'All' dataset) and high force errors, indicating significant limitations in current models' ability to accurately describe different charge and spin states.
    • Distance Scaling: Performance is reasonable in the short-range regime but degrades significantly in the long-range regime, especially for energy predictions, highlighting the absence of explicit long-range interaction treatment in the baseline models.
  • Wiggle150: Baseline models trained on OMol25 show errors against the CCSD(T)/CBS reference comparable to or better than highly accurate DFT functionals (like ω\omegaB97M-V) benchmarked in the original Wiggle150 paper, demonstrating strong performance on strained organic conformers.

Outlook and Future Directions

The authors acknowledge that OMol25, despite its scale and diversity, still has gaps (e.g., radioactive elements, polymers, limited coverage of specific classes). The baseline results show that while current models achieve strong performance on average energy/force prediction and some evaluation tasks, there are significant challenges remaining, particularly for predicting IE/EA, spin gaps, and long-range interactions. The dataset includes partial charge and spin data to encourage development in these areas. Future work will include releasing additional computed properties (multipole moments, densities) and developing new evaluations (free energy, reactivity path optimization, Hessians, spectroscopic properties). A public leaderboard is planned to foster community innovation.

In conclusion, OMol25 provides a massive, high-quality, and diverse dataset for training MLIPs for molecular chemistry, along with a challenging set of evaluations and baseline models. The open release of this resource aims to accelerate the development of next-generation ML models capable of achieving DFT accuracy across a broad range of chemical problems.

Youtube Logo Streamline Icon: https://streamlinehq.com