QCML Dataset: Quantum Chemistry ML Corpus
- QCML Dataset is a large-scale quantum-chemistry machine learning corpus featuring 33.5M molecular geometries computed at a hybrid DFT level with explicit dispersion corrections.
- It encompasses diverse chemical systems spanning nearly the entire periodic table (Z<86) and multiple conformers per molecular graph, facilitating robust ML force field and MD model development.
- The dataset underpins foundation model pretraining and advanced spectroscopic property prediction, driving innovation in computational chemistry and materials science.
The QCML dataset defines a quantum-chemistry machine learning corpus of unprecedented scale and chemical scope, serving as a backbone for general-purpose ML force fields, molecular dynamics (MD) modeling, and spectroscopic property prediction. Comprising up to 33.5 million structurally diverse molecular geometries computed at hybrid DFT level with explicit dispersion corrections, QCML benchmarks universal representation-learning architectures such as transformers, equivariant GNNs, and kernel methods across a spectrum of chemical, elemental, and structural regimes. Its downstream ecosystem—encompassing force-field pretraining (MD-ET, MACE4IR), biomolecule and fragment expansion (QCell), and integration with multi-fidelity and advanced quantum datasets—positions QCML as a foundational resource for data-driven computational chemistry and materials science.
1. Dataset Composition and Chemical Coverage
The QCML dataset encompasses approximately 33.5 million reference quantum-chemistry entries, each corresponding to a distinct molecular geometry. Calculations were performed at the PBE0 + D3 (dispersion-corrected) hybrid DFT level—using all-electron, numeric atom-centered orbitals and tight convergence criteria (Eissler et al., 3 Mar 2025). The molecules included span the entire periodic table up to (excluding only the heaviest elements), with significant representation of transition metals, main-group elements, organic and inorganic compounds, atmospheric molecules, charged species, spin multiplicities, and biologically relevant molecules (Bhatia et al., 26 Aug 2025).
Each molecular graph is typically associated with multiple conformers, encompassing both equilibrium geometries and out-of-equilibrium structures generated by normal-mode distortions or finite-temperature sampling (Eissler et al., 3 Mar 2025). Molecule sizes range from a few atoms to several dozen.
| Statistic | Value (QCML core) | Element Coverage |
|---|---|---|
| # structures | ∼33.5 million | ~80 elements (H–Rn, Z<86) |
| DFT level | PBE0+D3 | Main group, transition metals, anions, cations |
| Typical size | Few–dozens of atoms | H, C, N, O, P, S dominant; broad periodic table |
| Geometries per graph | Multiple (N/A) | Conformers, normal modes, distortions |
The QCML data pool forms the backbone for pretraining large foundation models such as MD-ET (Eissler et al., 3 Mar 2025) and MACE4IR (Bhatia et al., 26 Aug 2025). Filtering for neutrality (charge ), singlet spin, and outlier flags yields ~15.6 million high-quality DFT geometries, from which fixed-size training/validation/test splits (10M/100k/100k in MACE4IR) were generated (Bhatia et al., 26 Aug 2025).
2. Quantum-Chemical Computations and Reference Properties
All structures in QCML are annotated with quantum-chemically computed total energies and atom-resolved forces , defined as
where is the position of atom (Bhatia et al., 26 Aug 2025). Dipole moments are also included for each geometry. Force and energy labels are computed directly at the DFT level—there is no energy-conservation or differentiability assumed for learning purposes in the core MD-ET workflow (Eissler et al., 3 Mar 2025).
DFT calculations utilize FHI-aims ("tier-2", tight integration grid), with ZORA scalar relativity for heavy elements, geometry optimization converged to 1 meV/Å in training subsets, and explicit many-body dispersion (D3) correction (Bhatia et al., 26 Aug 2025).
3. Data Preprocessing, Filtering, and Splits
Minimal preprocessing is applied beyond explicit grouping of conformers by molecular graph, ensuring that all conformers for a molecular graph are assigned to the same split (train/val/test) to avoid data leakage (Eissler et al., 3 Mar 2025). Energy/force normalization is not performed during MD-ET pretraining; elemental or molecular-class stratification is implicit in random draws (Bhatia et al., 26 Aug 2025).
Data augmentation for model pretraining is performed via group operations: for each mini-batch of structures, one duplicate batch is subjected to a random rotation + reflection, yielding an augmented batch of $2N=1024$, enforcing approximate rotational equivariance in the trained architecture (Eissler et al., 3 Mar 2025).
The canonical split is 90%/5%/5% by molecular graph for train/val/test, corresponding to 27M/1.5M/1.5M structures for the ∼30M subset used in MD-ET (Eissler et al., 3 Mar 2025); for filtered singlet-neutral data in MACE4IR, the split is 10M/100k/100k (Bhatia et al., 26 Aug 2025).
4. Input Representations and Edge-Transformer Architecture
The primary ML input is the full atomic geometry, annotated by atomic numbers , formal charges , spins , and Cartesian positions . Model architectures tokenize each pair of atoms —including self-loops—into edge features (Eissler et al., 3 Mar 2025):
- Atomic embeddings: One-hot encodings of , and , mapped through linear layers, plus precomputed electron configuration embedding .
- Distance embedding: mapped to 128 radial basis functions (RBFs) , fed to a small MLP .
- Direction embedding: Spherical angles mapped to 128 Fourier features , then a linear layer .
The final edge token is the sum , with passed to a stack of Edge Transformer layers (12 layers, 12-head attention in MD-ET) (Eissler et al., 3 Mar 2025).
5. Supervised Training Objective and Loss Metrics
The pretraining objective for MD-ET and similar architectures is pure force regression. The model predicts a force vector for each atom , using an loss: No energy-prediction or force-energy weighting is applied. MAE and RMSE on test set forces are the primary evaluation metrics:
- Final test MAE on QCML: 0.69 kcal mol⁻¹ Å⁻¹ (MD-ET, Table 1).
- Approximate equivariance error: kcal mol⁻¹ Å⁻¹ (no frame-averaging), $0.02$ kcal mol⁻¹ Å⁻¹ (4-frame average), which is two orders of magnitude below typical force magnitudes (Eissler et al., 3 Mar 2025).
QCML's original DFT force labels have measurable residual errors due to SCF and integration convergence. Refined computation using TightSCF and denser grids suggests a per-atom net force mean of 0.45 meV/Å, RMSE 1.10 meV/Å, on par with SPICE but above ANI-1x quality standards (Kuryla et al., 22 Oct 2025).
6. Statistical Summaries, Elemental and Structural Distributions
While the MD-ET and MACE4IR sources do not publish exhaustive atom-count histograms, element frequencies, or full energy/force distributions (mean, variance), they establish that organic (H/C/N/O/P/S) systems predominate numerically, but almost all elements with appear in the dataset (Bhatia et al., 26 Aug 2025). More granular statistics (elemental breakdown, atom count per molecule, etc.) are available in companion QCML or QCell publications.
Learning-curve analyses on filtered QCML subsets show decreasing MAEs as training set size increases; best MACE4IR models (Large/float64, 10M samples) achieve 2.1 meV/atom (energy), 30 meV/Å (force), 23 meÅ (dipole moment) on a held-out 100k test set (Bhatia et al., 26 Aug 2025).
7. Impact: Downstream Uses and Limitations
QCML serves as the reference corpus for general-purpose, element-agnostic ML force fields (MD-ET (Eissler et al., 3 Mar 2025), MACE4IR (Bhatia et al., 26 Aug 2025)), and as a pretraining backbone for biomolecule-augmented datasets (e.g., QCell adds 525k high-fidelity calculations for nucleic acids, lipids, saccharides, and ion clusters to reach 41M unified points (Kabylda et al., 11 Oct 2025)). Downstream, these models achieve state-of-the-art performance on canonical MD, IR, and transfer-learning tasks with minimal fine-tuning (saturating at low error with only 10M pretraining samples) (Bhatia et al., 26 Aug 2025).
Known limitations and corrections include nonzero net forces due to moderate convergence thresholds and approximate numerical integration in the original QCML entries. Enhanced SCF convergence, exclusion of high-net-force structures, and publication of per-structure quality metrics are recommended when targeting sub-meV/Å accuracy in MLIP development (Kuryla et al., 22 Oct 2025).
References
- (Eissler et al., 3 Mar 2025) "How simple can you go? An off-the-shelf transformer approach to molecular dynamics"
- (Bhatia et al., 26 Aug 2025) "MACE4IR: A foundation model for molecular infrared spectroscopy"
- (Kuryla et al., 22 Oct 2025) "How Accurate Are DFT Forces? Unexpectedly Large Uncertainties in Molecular Datasets"
- (Kabylda et al., 11 Oct 2025) "QCell: Comprehensive Quantum-Mechanical Dataset Spanning Diverse Biomolecular Fragments"
Further technical details and raw statistics can be found in the primary QCML and QCell publication series.