OMol25: High-Precision Molecular DFT Dataset

Updated 4 October 2025

OMol25 is a large-scale DFT dataset offering high-precision quantum chemistry calculations across 83 million molecular systems with extensive chemical diversity.
It employs a consistent ωB97M-V/def2-TZVPD methodology to compute reliable properties such as energies, forces, and electronic descriptors, ensuring robust ML benchmarking.
Rigorous quality control and diverse sampling—from biomolecules to metal complexes—position OMol25 as a standard for developing generalizable ML interatomic potentials.

The Open Molecules 2025 (OMol25) dataset is a large-scale, high-precision quantum chemistry dataset that provides more than 100 million density functional theory (DFT) calculations at the ωB97M-V/def2-TZVPD level, representing billions of CPU core-hours of computation. OMol25 is designed to address the persistent challenge in molecular machine learning: the absence of a dataset that jointly delivers high-level DFT accuracy and extensive chemical diversity, thereby supporting the development and rigorous benchmarking of generalizable machine learning interatomic potentials (MLIPs).

1. Dataset Scope and Structure

OMol25 comprises approximately 83 million unique molecular systems, each represented by one or more DFT-calculated snapshots. Systems span a wide size range, from diatomics up to 350-atom species per structure—vastly larger than most existing DFT datasets. All properties were calculated using the range-separated hybrid meta-GGA functional ωB97M-V and the def2-TZVPD triple-zeta basis set, with diffuse augmentation essential for handling anionic species.

The dataset encapsulates both equilibrium and nonequilibrium configurations, supporting properties critical for molecular simulation:

Total energies
Per-atom forces
Partial atomic charges
Orbital energies (including highest occupied and lowest unoccupied levels)
Multipole moments
Additional auxiliary electronic descriptors

These properties were computed to uniformly enable downstream training and evaluation of force fields, property predictors, and simulation tools.

2. Chemical and Structural Diversity

OMol25 is distinguished by breadth across the periodic table and chemical space:

Elements: Includes all of the first 83 elements (H through Bi), integrating main group elements, transition metals, lanthanides, actinides, metalloids, and representative heavy elements.
System Types: Four principal domains ensure coverage:
- Biomolecules: Protein–ligand complexes, binding pocket fragments, protein–protein interfaces, and DNA/RNA fragments, with protonation/tautomeric state distributions generated via Epik and MD protocols.
- Metal Complexes: Produced both algorithmically via the Architector framework (sampling metals, ligands, oxidation/spin states, and coordination environments) and via extraction from the Crystal Open Database (COD).
- Electrolytes: Both experimental mixtures (from battery literature) and randomly sampled clusters with diverse ionic and solvent compositions, including explicit out-of-distribution (OOD) ions.
- Community Structures: Redetermined structures from prior ML datasets (ANI-2X, Transition-1X, OrbNet Denali), preserving continuity with preexisting benchmarks.
Charge and Spin Multiplicities: Explicitly sampled (e.g., high-spin, low-spin states for transition metal complexes) to promote transferability across redox and electronically open-shell domains.
Configuration Diversity: Conformer ensembles and reactive snapshots (e.g., ring-opening, proton transfer, electron transfer intermediates) are included. Structures are sampled from both conventional MD and quantum path integral molecular dynamics (RPMD), particularly for light-atom quantum fluctuations.

3. Computational Protocols and Quality Control

The calculations utilized Meta’s Elastic Compute on preemptible, heterogeneous clusters, amounting to billions of core-hours. Rigorous dataset curation was enforced at all stages:

Force and Energy Screening: Snapshots with maximal per-atom forces exceeding 50 eV/Å or falling outside ±150 eV energy windows were discarded.
Spin Treatment: Expectation values of S² were monitored in open-shell calculations; calculations with severe spin contamination were removed.
Numerical Precision: The DEFGRID3 setting was used in ORCA 6.0.0 (590 angular points for exchange-correlation, 302 for COSX), mitigating numerical noise between energy gradients and forces.
Systematic Error Removal: A pretrained ML model was used to flag problematic cases in early data releases, for example, isolating metal centers converging to unintended electronic states.

For systems with broken bonds or high spin, calculations were explicitly performed in the unrestricted Kohn–Sham (UKS) formalism, ensuring correct energetics for radicals and transition states.

4. Baseline Models and Benchmarks

OMol25 includes an array of robust baseline evaluations using recent state-of-the-art equivariant graph neural network (GNN) architectures:

eSEN (with multiple model sizes and hyperparameters)
GemNet-OC
MACE

All baseline models were extended to be charge- and spin-aware by using additional embeddings for these quantum numbers.

Metrics emphasize generalization and transferability, with comprehensive reporting:

Model	Metric	Out-of-Distribution (OOD) Test Error
eSEN-md	Energy MAE (meV/atom)	~1–2
eSEN-md	Force MAE (meV/Å)	Comparable to energy MAE
GemNet-OC, MACE	Energy/Force MAE	Full comparisons reported in paper

Tasks extend beyond direct property prediction, incorporating:

Conformer ensemble ranking (using Boltzmann-weighted RMSD and ΔE)
Ligand strain (energy difference between local and global minima)
Protein–ligand interaction energy:

$E_{\text{interaction}} = E_{\text{complex}} - \left( E_{\text{ligand}} + E_{\text{receptor}} \right)$
Vertical ionization energy (IE) and electron affinity (EA) under fixed geometry
Spin-gap (energetic difference between high- and low-spin states)
Scaling of intermolecular forces with distance (enforcing correct 1/r, 1/ $r^6$ physics)

Splits include “training,” “validation,” and stress-testing “out-of-distribution” test sets, directly measuring model ability to extrapolate to new chemistries and sizes.

5. Generation Workflows and Data Domains

Domain-specific protocols underpin the OMol25 sampling:

Architector Workflow: For metal complexes, random selection of nearly all transition/main group metal centers, sampled coordination numbers (from experimental frequencies), and assembly using a curated library of 723 distinct ligands, maximizing diversity in denticity, coordinating atom identities, and total molecular size.
MD and RPMD Sampling: Generates chemically and structurally diverse geometries, especially for systems needing nuclear quantum effect coverage.
Community Datasets: Full recomputation ensures property alignment (level-of-theory, grid settings) with OMol25, even for externally sourced datasets.

All calculations are performed with precise, high-level DFT methods, maintaining uniformity in property labels across the chemical and structural diversity spanned.

6. Technical Innovations, Impact, and Community Role

OMol25 introduces several technical advances:

Scale: Unprecedented dataset size using high-level quantum chemical methods, scaling up the number and size of systems by 1–2 orders of magnitude compared to previous datasets.
Diversity: Simultaneous coverage of organic, biomolecular, inorganic, and electrolyte chemistries, including underexplored elements and molecular classes.
Quality Control: Stringently defined numeric thresholds and checks, as well as ML-based systematic error detection.
Benchmark Standardization: Facilitates fair comparison for next-generation MLIP methods, providing rigorous split definitions and domain-adapted challenge tasks.

The dataset’s broad chemical and structural coverage positions it as a foundation for ML models with improved transferability—including across subfields such as drug discovery, enzymology, batteries, and heterogeneous/organometallic catalysis. The high-quality, multi-property labels also underpin secondary evaluation metrics such as conformer energies, charge reorganization (IE/EA), binding/free energies, and reactivity.

OMol25 significantly extends on prior molecule-focused DFT datasets, such as QM9 (organic small molecules, <20 atoms) and community ML datasets (ANI-2X, Transition-1X, OrbNet Denali):

Dataset	Max Atoms/System	Elements	Domains Covered	DFT Level
OMol25	350	83	Organic, bio, inorg, electrolyte	ωB97M-V/def2-TZVPD
QM9	<20	5	Small organics	B3LYP/6-31G(2df,p)
OMC25	300 (crystals)	12	Molecular crystals	PBE-D3/VASP
ANI/Transition/OrbNet	<30–100	≤10	Organics, reactions	Various

Distinctive OMol25 features include the use of a single consistent DFT protocol for all calculations, much broader element inclusion, and rigorous inclusion of charge/spin diversity and non-equilibrium configurations.

8. Outlook and Community Use

OMol25 serves as a new standard for benchmarking and pretraining ML molecular models, offering chemical and physical diversity for the exploration of scaling laws and transferability in MLIPs. Its annotated splits and evaluation standards provide a reproducible substrate for community-driven advances in ML for molecular chemistry and materials science. The inclusion of reference implementations and baseline models encourages open participation and accelerates progress in the development of robust, accurate, and generalizable ML-based force fields and property predictors.

Markdown Report Issue Upgrade to Chat

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OMol25 Dataset.