Open Molecules 2025 (OMol25): Quantum-Chem Data
- OMol25 is a foundational quantum-chemical dataset featuring millions of DFT-calculated molecular configurations that drive progress in machine-learning interatomic potentials and molecular AI.
- It employs a consistent ωB97M-V/def2-TZVPD protocol across diverse chemical domains to provide reproducible, high-fidelity energy and force benchmarks.
- The dataset supports systematic scaling studies and standardized benchmarking splits, integrating seamlessly with universal molecular AI frameworks for cross-domain research.
Open Molecules 2025 (OMol25) is a foundational quantum-chemical dataset specifically constructed to advance machine-learning interatomic potentials (MLIPs) and universal molecular AI. With its unprecedented scale, chemical diversity, and high-fidelity density functional theory (DFT) computations, OMol25 forms the backbone for both large-scale molecular modeling benchmarking and integrative research in computational chemistry, molecular simulation, and generative molecular understanding (Levine et al., 13 May 2025, Elhag et al., 25 Sep 2025, 2502.01074).
1. Origins, Motivations, and Design Goals
OMol25 originated from the need for a dataset that simultaneously addresses (1) the lack of comprehensive, high-accuracy, open-access DFT datasets spanning the entire periodic table (through Bismuth, Z=83); (2) systematic support for data- and model-scaling studies; and (3) robust benchmarking of MLIPs, including out-of-distribution (OOD) generalization (Elhag et al., 25 Sep 2025, Levine et al., 13 May 2025).
Key design objectives included:
- Generating millions of unique molecular configurations with energies and forces computed at a single, consistent DFT level (ωB97M-V/def2-TZVPD) to avoid bias from methodological heterogeneity.
- Covering a broad swath of chemical domains, including small organics, biomolecules, electrolytes, inorganic/organometallic complexes, and reactivity intermediates.
- Providing standardized data splits for both in-distribution and OOD evaluation, with domain-specific subsets to enable targeted, reproducible benchmarking.
- Ensuring data are accessible and broadly reusable through open-source provision and a unified API (FAIRChem, version 2.2.0).
2. Dataset Composition, Scale, and Organization
The OMol25 dataset comprises approximately 100 million DFT single-point calculations, representing about 6 billion CPU-core hours aimed at maximizing chemical, structural, and elemental diversity (Levine et al., 13 May 2025). The dataset contains approximately 83 million unique systems, spanning molecular sizes from diatomics to 350-atom complexes, with coverage summarized by domain:
| Domain | Snapshots (N=100M) | Atom-count share | Unique systems (%) |
|---|---|---|---|
| Biomolecules | 18,000,000 | 19% | 12% |
| Metal complexes | 19,000,000 | 19% | 14% |
| Community organics | 22,000,000 | 22% | 20% |
| Electrolytes | 30,000,000 | 30% | 25% |
| Reactivity/Other | 11,000,000 | 10% | 29% |
Domain assignments reflect broad chemical basis: OMol25 includes not only small molecules and biomolecular fragments but also explicit solvation, spin/charge variation (charges −10 to +10, spins 1–11), and a large body of reactive structures (e.g., transition-state–like geometries from AFIR, geodesics, Popcornn).
- Elemental Coverage: Every element Z=1–83 appears in ≥10,000 structures; typical heavy-atom coverage per element spans 10⁵–10⁸ snapshots.
- Conformational and Reactive Sampling: Classical and ML-based MD, CREST, ESMACS, AFIR, and other protocols ensure coverage of equilibrium, off-equilibrium, and highly reactive configurations (Levine et al., 13 May 2025).
3. Data Generation, Preprocessing, and Statistical Characterization
All DFT computations use the ωB97M-V functional with def2-TZVPD basis in ORCA 6.0, with RI-J and COSX acceleration, tight convergence, and GRID3 settings. This protocol provides consistent, energy–force-consistent reference data for MLIP development.
Each configuration includes:
- Centered atomic coordinates (by the molecular center of mass).
- Atomic numbers, total molecular charge , and spin multiplicity .
- Per-configuration energy (eV or eV/atom) and atom-wise forces (eV/Å).
- Index files with SMILES, InChI keys, and split assignments.
Data diversity is characterized by:
- Bond-order (single to triple, dative/metal–ligand).
- Coordination numbers (metals: 2–12).
- Conformer-energy ranges (up to several eV above minima).
- Statistical metrics for elemental, charge, spin, and size distributions can be computed by histograms or through embedding-space coverage, where for region ,
4. Data Access, Splits, and Benchmarking Protocols
OMol25 is openly distributed via Zenodo and GitHub, formatted in HDF5 or sharded NPZ archives, with per-molecule arrays and comprehensive index files. The FAIRChem package provides canonical scripts and loaders for standardized training and evaluation (Elhag et al., 25 Sep 2025).
Canonical splits:
- Training split (“4M split”): ~3.99 million samples.
- Validation split (Val-Comp): ~2.76 million OOD configurations.
- Domain-specific validation sets: metal complexes, electrolytes, biomolecules, neutral organics (e.g., ANI-2X, OrbNet-Denali, GEOM), SPICE, reactivity (typically ~20k samples each).
- Additional test splits for leaderboard evaluations.
Benchmarking support:
Evaluation metrics are standardized via FAIRChem and include:
- Force MAE (eV/Å):
- Force cosine similarity:
- Energy per atom MAE (eV/atom):
- Total energy MAE (eV):
Seven additional downstream tasks are defined: protein–ligand interaction, ligand strain, conformer ranking, protonation energies, unoptimized IE/EA and spin gaps, distance-scaling noncovalent interactions, and strained conformers (Wiggle150) (Levine et al., 13 May 2025).
5. Baseline MLIP Models and Evaluations
OMol25 provides baseline models, including three versions of Equivariant Spectral ENcoder (eSEN), GemNet-OC, and MACE (on the neutral split) (Levine et al., 13 May 2025):
- eSEN: E(3)-equivariant Transformer using a learned basis with variable , nodewise normalization, and gated nonlinearity.
- GemNet-OC: Invariant message-passing GNN using atom/edge/triplet/quad features and spherical harmonics.
- MACE: High-order equivariant message passing for explicit three- and four-body interactions.
Key results on the "All" split:
| Model | Energy MAE (meV/atom) | Force MAE (meV/Å) | PL Energy (meV) | IE/EA ΔE (meV) | SR ΔE (meV) | LR ΔE (meV) |
|---|---|---|---|---|---|---|
| eSEN-sm-cons. | 1.35 | 7.39 | 147.3 | 315 | 28.6 | 268 |
| GemNet-OC | 0.57 | 5.85 | 19.4 | 254 | 11.8 | 143 |
Observed limitations include persistent large errors for charge/spin-dependent properties and long-range dissociation, motivating future architectural innovations.
6. Integration with Molecular AI and Universal Models
OMol25 serves as both a training corpus and robust benchmark for universal molecular AI frameworks, such as Omni-Mol (2502.01074). These systems leverage OMol25’s scale and chemical diversity to develop:
- Unified encoding mechanisms that integrate natural-language instructions, SELFIES strings, and molecular graphs via a single encoder.
- Active-learning-based data selection to sample the most informative data (Omni-Mol uses a 40% subset selected iteratively for maximal informativeness).
- Advanced stabilization strategies (e.g., adaptive LoRA scaling, anchor-and-reconcile experts in MoE architectures) to manage gradient variance and inter-task conflict.
- Instruction-tuned, multi-task architectures achieving state-of-the-art performance on reaction, regression, description, and action tasks.
- Characterization of scaling laws: average performance fits (data scaling), and (model scaling), with near 0.98, indicating robust generalization as model/data scale increases.
The convergence of task spaces and unified representations supports OMol25’s goal of establishing a common platform for integrative molecular AI.
7. Applications, Impact, and Future Directions
OMol25 enables:
- Large-scale and systematic scaling studies for MLIP architectures;
- Cross-domain, cross-architecture benchmarking under standardized protocols;
- Empirical analysis of symmetry-enforcing strategies (e.g., rotational augmentation vs. learned equivariance (Elhag et al., 25 Sep 2025));
- Accelerated high-throughput molecular screening, ML-accelerated ab initio MD, and property prediction across organics, biomolecules, metal complexes, and electrolytes;
- Development of charge/spin- and polarization-aware force fields for batteries, catalysis, drug discovery, and more.
Planned and suggested extensions include coverage of actinides (), multimetallic clusters, computation of Hessians, spectroscopic observables, free-energy workflows, and expansion to new domains such as radical cascades and generative molecular design (Levine et al., 13 May 2025). OMol25 is positioned as an open, scalable resource for academic and industrial research communities aiming to construct the next generation of AI-driven molecular models and simulations.