MAD Dataset: Universal Atomic Configurations

Updated 31 August 2025

MAD dataset is a compact, systematically constructed collection of diverse atomic configurations for training universal machine-learning interatomic potentials.
It employs uniform DFT settings and advanced randomization techniques to capture both equilibrium and non-equilibrium structures.
The dataset’s diverse subsets and latent mapping enable robust, transferable models that rival those trained on much larger datasets.

The Massive Atomistic Diversity (MAD) dataset is a compact yet systematically constructed collection of atomic configurations designed for training universal machine-learning interatomic potentials. The underlying philosophy is to enable models to predict energies and forces for arbitrary atomic structures—including those far from equilibrium—which contrasts starkly with the stability-focused sampling strategy typical of previous datasets. MAD incorporates extensive chemical and geometric diversity by deliberately including highly distorted, randomized, and out-of-equilibrium structures, all computed at a consistent level of electronic-structure theory. This resource has proven capable of powering universal models that rival specialized models trained on much larger datasets, representing a substantial step towards “one model fits all” approaches in computational materials science (Mazitov et al., 24 Jun 2025).

1. Design Philosophy and Motivation

The MAD dataset was conceived to train machine-learning models that extrapolate reliably beyond equilibrium geometries. Rather than targeting only low-energy, physically plausible structures, MAD systematically extends its coverage of configuration space. This approach enables robust interpolation and extrapolation for models deployed in molecular dynamics and materials discovery, especially in settings that involve significant atomic displacements, random chemistry, or complex surfaces. A principal design criterion is methodological consistency: all quantum mechanical data are computed under identical DFT conditions, prioritizing internal uniformity even at the expense of compound-specific physical detail.

Distinguishing features include:

Aggressive inclusion of non-equilibrium, chemically randomized atomic configurations.
Inclusion of both organic and inorganic matter, blurring distinctions typical of legacy datasets.
A “one level of theory for all” policy (e.g., consistent functionals, plane-wave cutoffs), even when this omits system-specific physical effects such as explicit magnetism or dispersion.

2. Dataset Composition and Diversity Engineering

MAD consists of fewer than 100,000 total structures, subdivided into well-defined sets engineered for maximal coverage:

Subset Name	Description	Construction Method
MC3D	Stable bulk crystals from a 3D database	Direct inclusion
MC3D-rattled	Same as MC3D, with large Gaussian noise (20% covariance) applied	Atomic displacement
MC3D-random	Random atomic species on bulk sites + isotropic cell volume rescale	Random element assignment
MC3D-surface	Surfaces from random low-index cuts of MC3D	Cleavage of bulk crystals
MC3D-cluster	Small clusters (2–8 atoms) from bulk environments	Extraction of proto-clusters
MC2D	Two-dimensional crystals	2D database extraction
SHIFTML subsets	Molecular crystals, molecular fragments	Fragmentation and direct inclusion

The main diversity mechanisms are as follows:

Atomic position ‘rattling’ introduces significant geometric distortions.
Random chemical substitutions enforce broad chemical diversity (85 elements except Astatine).
Surfaces are generated through random slicing of bulk crystals.
Structural clusters introduce low-coordination environments.

These strategies ensure wide coverage of both chemical and configuration space. As demonstrated in property histograms, energy and force values in MAD include long “tails” into high-energy regions—unlike conventional datasets, which are dominated by near-minimum-energy states.

3. Computational Settings and Uniform Theory Level

MAD employs a uniform density functional theory setup for all electronic structure calculations. All geometries are evaluated nonmagnetically under consistent DFT parameters:

A single generalized-gradient approximation (GGA) exchange-correlation functional is used.
High plane-wave cutoffs (≈110 Ry for wavefunctions, ≈1320 Ry for charge density) and SSSP pseudopotentials are employed for all structures.
This consistent mapping from atomic structure to energy—as opposed to tuning the methodology for each configuration—ensures that machine-learning models trained on MAD are not confounded by inconsistent reference energies or basis sets.

This intentionally precludes the fine-tuning often needed for special materials (e.g., strong correlation, explicit magnetism, non-local interactions), but grants the dataset the internal consistency required for robust model development.

4. Structural Representation and Latent Space Construction

MAD structures are encoded using high-dimensional descriptors derived from the Point Edge Transformer (PET) architecture:

$\xi(A_i) \in \mathbb{R}^{512}$

where $\xi(A_i)$ is the vector for the atom-centered environment $A_i$ . The overall structure $A$ is represented through the concatenation of statistical moments:

$\Xi(A) = \Big[\langle \xi(A_i) \rangle_{i=1}^{N_A},\ \sqrt{\langle (\xi(A_i) - \langle \xi(A_i) \rangle)^2 \rangle_{i=1}^{N_A}}\,\Big]$

resulting in a 1024-dimensional descriptor that encodes both the average local environment and the degree of inhomogeneity, and is invariant under periodic replication.

To interpret and visually compare datasets, MAD introduces a low-dimensional latent “cartography” via the sketch-map algorithm:

$\mathcal{L}_{sm} = \sum_{i \neq j} w_{ij}\ [F(D_{ij}) - f(d_{ij})]^2$

where $D_{ij}$ and $d_{ij}$ are Euclidean distances in high and low dimension, respectively, and $F$ and $f$ are sigmoid functions emphasizing local distance preservation.

A multi-layer perceptron (MLP) is trained to perform this out-of-sample mapping, enabling fast projection and effective comparison with other datasets. This latent space is used to identify gaps or redundancies in chemical coverage and serves as a diagnostic for dataset completeness.

5. Applications and Empirical Performance

MAD is specifically structured for training machine-learning interatomic potentials that are:

Universal: Capable of learning over both organic and inorganic chemical spaces and handling arbitrary element pairs.
Robust: Effective across both physically plausible and highly distorted geometries, supporting accurate MD simulations in nonequilibrium regimes.
Parsimonious: Models trained on MAD (such as PET-MAD) match or outperform those trained on much larger traditional datasets—even though MAD contains two to three orders of magnitude fewer configurations.

Empirical evaluations demonstrate:

Comparable or superior force and energy accuracy relative to models built exclusively on stable structures.
Enhanced ability to extrapolate to previously unseen chemistries or extreme atomic environments—a feature critical for applications in design, screening, and exploration under extreme conditions (Mazitov et al., 18 Mar 2025).

6. Innovations and Impact on Atomistic Machine Learning

MAD innovates over conventional datasets by blending aggressive configurational and compositional randomization with consistent, high-quality reference data. Its adoption as a universal data resource underpins several modern ML models and workflows:

The PET-MAD model—a transformer-based graph neural network—exploits MAD to learn universal potentials and, paired with LoRA-based efficient fine-tuning, bridges the gap between generality and high fidelity.
MAD’s structural descriptors and latent space mapping provide a toolset for dataset comparison, selection of representative training samples, and informed augmentation.
The approach streamlines the development of materials “cartography” and data-driven design by demarcating regions of chemical and geometric interest.

MAD thus represents a foundational component for developing ML potentials aiming for broad transferability and resilience against dataset bias, while minimizing redundancy and computational waste.

7. Outlook and Future Development

MAD’s philosophy suggests several directions for further expansion:

Incorporating additional physics, such as explicit magnetism, strong correlation, or van der Waals interactions, where relevant.
Targeting edge cases (e.g., reactive intermediates, complex reconstructions, amorphous systems) to further broaden the boundary of configuration space.
Refining latent projection methods to increase interpretability and facilitate automated selection of landmarks for active learning.
Its universal scope enables next-generation “one model fits all” simulations and may accelerate the deployment of robust interatomic potentials for heterostructure, battery, and catalysis research.

A plausible implication is that as MAD’s cartographic approach matures, materials space can be mapped, sampled, and modeled with increasing automation and reliability, further reducing the gap between ab initio accuracy and practical simulation scale.