SPICE-v2 Quantum Chemistry Dataset

Updated 15 August 2025

SPICE-v2 is a large-scale quantum chemical dataset that nearly doubles molecule count and enhances compositional diversity for broad chemical coverage.
It systematically samples non-covalent and charge-sensitive interactions, enabling machine learning potentials to accurately model subtle energy gradients.
Nutmeg models trained on SPICE-v2 achieve energy prediction errors below 1 kcal/mol, supporting robust molecular simulations for varied chemical systems.

The SPICE-v2 dataset is a large-scale quantum chemical dataset designed for training and benchmarking machine learning potentials across a broad swath of organic and bioorganic chemical space. Developed as an extension and enrichment of the original SPICE quantum chemistry dataset, SPICE-v2 introduces significant increases in both compositional diversity and physical interaction types sampled, with model applications focused on accurate energy prediction and molecular simulation, especially for charged, polar, and non-covalently interacting systems (Eastman et al., 2024).

1. Chemical and Data Space Expansion

SPICE-v2 expands on the original SPICE dataset by nearly doubling the number of molecules and greatly increasing the diversity of chemistries, molecular sizes, and element types represented. The dataset now comprises over 20,000 molecules and roughly 2 million conformations. Notable highlights include:

Inclusion of 9,913 additional PubChem molecules and integration of previously omitted elements—boron (1,562 molecules) and silicon (1,952 molecules)—yielding 17 element types overall and an atom count range from 2 to 110 per molecule.
Specialized subsets targeting important biomolecular and pharmaceutical scenarios: PubChem drug-like molecules (28,039 molecules, 1,398,566 conformations), “Ligand, Amino Acid Pairs” (194,174 conformations), “Solvated PubChem Molecules”, and “Water Clusters”.

This expansion provides a multiplicative increase in accessible chemical space, both for covalently bound species and systems dominated by non-covalent interactions. The data organization extends prior quantum chemical collection methodologies by balancing the conformation count, atom span, and elemental diversity to address broader chemical and biophysical modeling requirements.

2. Non-Covalent Interaction Sampling

A major innovation in SPICE-v2 is its detailed focus on non-covalent interactions. The dataset now contains several data partitions constructed specifically to model weak, long-range, and solvent-mediated interactions:

The “Ligand, Amino Acid Pairs” subset represents nearly 200,000 conformations derived from combinatorial pairings between small molecule ligands (from Ligand Expo) and proteinogenic amino acids, sampling typical protein–ligand contacts.
“Solvated PubChem Molecules” and “Water Clusters” subsets include molecular systems in bulk and cluster solvent environments, increasingly sensitive to hydrogen bonding and electrostatic screening.

By providing explicit quantum chemical sampling in these scenarios, SPICE-v2 allows machine learning potentials to fit subtle energy gradients, enabling models trained on it to reproduce thermophysical and interaction properties that are essential for biomolecular, pharmaceutical, and materials simulations.

3. Nutmeg Model Architecture and Charge Injection Mechanism

SPICE-v2 underpins the training of the Nutmeg family of models—Nutmeg-small, Nutmeg-medium, Nutmeg-large—based on the TensorNet architecture, an equivariant message-passing network with hierarchical receptive fields determined by cutoff radii and interaction layer count.

A key innovation in Nutmeg models is the explicit injection of precomputed partial charges. Rather than relying on atomic numbers and coordinates alone, each atomic feature vector is prepended with its Gasteiger partial charge (denoted $q$ ). The input for an atom is thus:

$\text{feature}_\text{atom} = [\text{one-hot element}; q]$

This vector undergoes linear mixing via an embedding transformation $W$ :

$E_\text{atom} = W \cdot [e_\text{atom}; q]$

This structure supplies Nutmeg models with a reference distribution for large-scale charge effects, enabling markedly improved performance on charged, polar, and highly heterogeneous systems. The approach does not explicitly model global Coulomb interactions at inference, relying instead on the informativeness of the input charge descriptor.

Additionally, Nutmeg integrates a Ziegler–Biersack–Littmark (ZBL) short-range repulsive potential, activated via a cutoff function at low interatomic separations:

$E(r) = \tilde{U}(r) \cdot \text{ZBL}(r)$

where

$\text{ZBL}(r) = \frac{1}{4\pi\varepsilon_0} \frac{Z_1Z_2}{r} \phi(r/a)$

4. Model Evaluation and Generalization

Nutmeg models trained on SPICE-v2 report mean absolute energy errors (MAE) on the validation set well below the “chemical accuracy” threshold of 1 kcal/mol ($4.184$ kJ/mol), reaching even greater accuracy when considering energy differences between molecular conformations:

Error metrics for energy differences (mean-subtracted per molecule) show further reduction, critical for thermodynamic ranking applications.
Testing encompasses out-of-sample molecules ranging from small drug-like systems (40–80 atoms), peptides, and protein–ligand dimers.

Across these benchmarks, Nutmeg-large in particular demonstrates energy ranking errors up to an order of magnitude lower than the actual conformation energy differences, even for species larger and more highly charged than those in the training set. On charged molecules, absolute energy offsets scale with charge state, but the ranking of conformer energies (the usual target in predictive simulation) remains robust, confirming the utility of precomputed charge feature injection.

5. Molecular Dynamics Utility

Nutmeg models trained on SPICE-v2 are assessed for their ability to produce stable, physically sound molecular dynamics trajectories. Salient findings include:

Short (10-ps) MD simulations confirm no bond distortions, physiologically reasonable temperature maintenance, and capped force magnitudes (< 5000 kJ/mol/nm).
The ZBL potential prevents collapse at close range, precluding unphysical low-energy states and integration errors typical in force fields lacking hard-core repulsion.
Computational demands allow force calculation in milliseconds per step, rendering these potentials suitable for routine small molecule and peptide simulation.

However, performance for bulk solvent systems is limited, as indicated by water radial distribution function reproduction. The Nutmeg models are thus best suited for discrete molecules, clusters, and biological complexes rather than condensed phase systems.

6. Dataset Impact and Applicability

SPICE-v2 fundamentally increases available training data for quantum chemistry-based machine learning, supporting accurate modeling from small molecules to peptide-like and charged species. The dataset enables the engineering of potentials that approach quantum chemical accuracy for energy differences and offers robust stability in simulation. By directly sampling non-covalent and charge-sensitive systems, SPICE-v2 closes data gaps impeding transferability and reliability in model development for biochemistry, drug design, and general molecular science.

This enrichment provides an infrastructure for further model development, benchmarks for transfer learning, and a foundation for systematizing machine learning approaches to chemical simulation, bridging the gap between high-fidelity quantum techniques and fast, scalable prediction (Eastman et al., 2024).

PDF Markdown Chat (Pro)

References (1)

Nutmeg and SPICE: Models and Data for Biomolecular Machine Learning (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SPICE-v2 Dataset.