Locally Trained SOAP-GAP Model

Updated 11 November 2025

The paper introduces a machine-learned interatomic potential that uses SOAP descriptors and Gaussian process regression to predict energies, forces, and virials with quantum accuracy.
The methodology leverages active learning and local filtering of datasets to enhance accuracy and transferability within specific configurational regions.
Practical applications include large-scale molecular dynamics and Monte Carlo simulations, balancing computational efficiency with high-fidelity results.

A Locally Trained SOAP-GAP (Smooth Overlap of Atomic Positions - Gaussian Approximation Potential) model is a machine-learned interatomic potential constructed by fitting Gaussian process regression to local atomic environments, defined via the SOAP descriptor. This model targets specific configurational or spatial regions (such as defect cores, surfaces, or alloy concentrations) to achieve enhanced accuracy and transferability within the selected local domain. The model enables quantum-accurate prediction of energies, forces, and virials for atomistic simulations, with practical deployment in large-scale molecular dynamics, Monte Carlo sampling, and related techniques.

1. Theoretical Foundations

A SOAP-GAP model is predicated on the principle that the total potential energy of an atomic system can be decomposed as a sum over local atomic contributions, each represented as a feature vector derived from the local environment:

$E_\text{tot} = \sum_i E_i \,,$

where each $E_i$ is modeled as a function of the local atomic neighborhood using the SOAP representation and a kernel-based regression within the Gaussian process (GP) framework.

Neighbor Density and Basis Expansion

The local environment for atom $i$ is encoded via a smeared neighbor density

$\rho_i^\alpha(\mathbf{r}) = \sum_{j \in \text{nbrs}} \delta_{z_j, \alpha} \exp\left( -\frac{\left|\mathbf{r} - \mathbf{r}_{ij}\right|^2}{2\sigma^2} \right) f_\text{cut}(|\mathbf{r}_{ij}|)$

where $z_j$ is the chemical identity, $\sigma$ is the atomic Gaussian width, and $f_\text{cut}$ is a smooth radial cutoff. This density is projected onto a set of orthogonal radial basis functions $g_n(r)$ and spherical harmonics $Y_{lm}(\hat{r})$ :

$\rho^\alpha(\mathbf{r}) = \sum_{nlm} c_{nlm}^\alpha\, g_n(|r|)\, Y_{lm}(\hat{r})$

SOAP Power Spectrum

Permutationally, translationally, and rotationally invariant features are constructed from the expansion coefficients as the SOAP power spectrum:

$p_{nn'l}^{\alpha\beta} = \sum_{m=-l}^l [c_{nlm}^\alpha]^* c_{n'lm}^\beta$

All $p_{nn'l}^{\alpha\beta}$ are concatenated into a single descriptor vector $p$ for each atomic environment.

SOAP-GAP Kernel

Similarity between environments is measured by a normalized dot-product polynomial kernel:

$k(p, p') = (p \cdot p')^\zeta\ ,$

with typical exponents $\zeta = 2, 4$ . Normalization ensures $k(p, p) = 1$ .

Gaussian Process Regression

Each local atomic energy is expanded over $M$ "sparse" (inducing) environments $X_m$ as

$E_i = \sum_{m=1}^M \alpha_m k(p_i, p_m)$

with weights $\alpha_m$ found by solving the regularized linear system:

$(K + \epsilon I) \alpha = y$

where $K_{mm'} = k(p_m, p_{m'})$ , $y$ contains target observables (energies, forces, virials), and $\epsilon$ encapsulates regularization parameters for different observables.

2. Data Preparation and Active Learning Workflow

Locally trained SOAP-GAP models require an appropriately curated and labeled dataset. The principal steps in database construction, as implemented for Ag-Pd alloys (Rosenbrock et al., 2019), are:

Initial Dataset Generation
- Enumerate fcc- and bcc-based supercells up to 4 atoms for multiple compositions.
Preliminary MTP Fit
- Fit a polynomial MTP to DFT-computed structures.
Active Learning Iteration
- Relax each structure; high extrapolation grades trigger DFT calculation and inclusion in the training set.
Expansion
- Extend to larger cells (up to 12 atoms), yielding 10,850 structures and 774 unique DFT-relaxed configurations.
High-Precision DFT Evaluation
- All 774 training configurations re-evaluated at tight DFT settings (k-point density $\sim0.015$ Å $^{-1}$ , EDIFF= $10^{-8}$ ).
Locality Filtering
- Restrict training references by "config_type" (e.g., "surf" for surface, or select regions in alloys), if fitting a local model (Klawohn et al., 2023).

No off-lattice or liquid data is included during training for solid-phase models.

3. Descriptor, Sparsification, and Kernel Construction

Descriptor construction and sparsification are the computational bottlenecks for SOAP-GAP models. The procedure is as follows:

SOAP Descriptor Evaluation
- For each atom in each frame, evaluate $p_{nn'l}^{\alpha\beta}$ within a cutoff $r_\text{cut}$ (e.g., $4.5$–$5.0$ Å).
- Descriptor dimensionality scales as $S^2 n_\text{max}^2 l_\text{max}$ , where $S$ is the number of elements.
Compression Techniques
- Apply embedding (e.g., $Z_\text{mix}$ , $R_\text{mix}$ ) to map chemical and radial indices to lower $K$ -dimensional representation.
- Use reduced forms such as $\nu_R$ , $\nu_S$ schemes to achieve linear scaling with $K$ .
Sparse Point Selection
- Choose $M$ representative environments (e.g., $M$ =500–2000) via CUR, $k$ -means, random, or uniform sampling.
- Remove near-duplicates according to a jitter threshold.
Kernel Assembly
- Build kernel blocks $K_{MM}$ ( $M\times M$ ) and $K_{NM}$ ( $N_\text{train}\times M$ ).
- Compute derivatives for force and stress learning.

4. Model Fitting, Regularization, and Validation

The fitting procedure specifically addresses the high-dimensional kernel regression while ensuring stability and generalization:

Regularization
- Distinct regularization ( $\sigma$ ) per observable: typical settings are $\sigma_\text{energy}=10^{-3}$ eV/atom, $\sigma_\text{force}=10^{-3}$ eV/Å, $\sigma_\text{virial}=0.02$ eV/atom.
- Additional small jitter ( $\sim 10^{-8}$ ) added to kernel diagonals.
Parallel Fit and Hyperparameter Optimization
- Distributed descriptor computation and linear algebra across $O(10^2)$ MPI ranks. Use ScaLAPACK QR for scalable solves (Klawohn et al., 2023).
- Hyperparameters (cutoff, basis resolution, kernel exponent, regularization strength) tuned by cross-validation on held-out frames (10–20% test split).
Model Validation
- Reported benchmarks (Rosenbrock et al., 2019):
- On liquid MD test set (6000 snapshots): energy RMSE $= 15.4$ meV/atom, force RMSE $= 224$ meV/Å, virial RMSE $= 8.3$ meV/Å $^3$ .
- Phonon spectrum integrated RMSE: $\sim 0.13$ THz, train/test parity.
- Transition path "swap" test: SOAP-GAP correctly predicts physical barriers without spurious minima.
Posterior Variance Diagnostics
- The prediction variance, $\sigma(x)$ , identifies out-of-sample areas where model uncertainty is large.

5. Implementation and Practical Usage

A locally trained SOAP-GAP model is typically implemented using the QUIP/GAP suite (Klawohn et al., 2023), with a workflow as follows:

Configuration File Example:

atoms_filename    = my_local_dataset.xyz
gap_file          = local_SOAPGAP.xml
sparse_jitter     = 1e-8
default_kernel_regularisation = {0.001 0.05 0.1 0.0}
config_type_parameter_name    = config_type
config_type_kernel_regularisation = {surf:0.0005:0.02:0.05:0.0}
gap = {
  soap
    cutoff                   = 5.0
    cutoff_transition_width  = 0.5
    n_max                    = 8
    l_max                    = 6
    atom_gaussian_width      = 0.5
    soap_exponent            = 4
    n_sparse                 = 2000
    sparse_method            = cur_points
    covariance_type          = dot_product
    energy_scale             = 1.0
}

Fitting Invocation:

1	mpirun -np 64 gap_fit config_file=config

When focusing on a local region, ensure that configurations are tagged accordingly (e.g., config_type="surf").

Hardware and Scaling:
- Descriptor memory scales as $N_\text{env} \cdot \dim(p)$ , with kernel memory as $N_\text{targets} \cdot M$ .
- Fitting $N_\text{env}\sim 10^5$ with $M\sim 10^4$ is feasible using MPI and descriptor compression.
- Training is typically completed within hours on $\mathcal{O}(100)$ CPU cores.
Post-Fitting Deployment:
- Validated XML models are compatible with ASE, LAMMPS, and other atomistic simulation environments.

6. Computational Cost and Comparative Performance

SOAP-GAP offers DFT-like accuracy but at significantly reduced computational cost relative to ab initio methods:

Model	Typical RMSE (energy)	Relative Evaluation Speed	Suitable for
SOAP-GAP	$15.4$ meV/atom	10³–10⁴× faster than DFT	Off-lattice modeling
MTP	$15$–$20$ meV/atom	$>10\times$ faster than GAP	Phase diagrams, large sampling

Evaluating a slice with $\sim2$ billion potential calls is feasible with MTP but remains prohibitive with SOAP-GAP. GAP evaluations scale as $\mathcal{O}(N_\text{basis} \times N_\text{sparse})$ and are several times slower per atom than MTP, but orders of magnitude faster than on-the-fly DFT (Rosenbrock et al., 2019).

7. Limitations and Domain of Applicability

Locally trained SOAP-GAP models, while excelling in local accuracy and transferability within their training domain, are limited in their extrapolation ability outside of that space. The exclusion of off-lattice and liquid configurations in training restricts the model’s generalization to high-temperature or disordered phases. Model quality is strongly dependent on choice and diversity of training data, regularization strength, and descriptor completeness. The computational requirements for fitting SOAP-GAP scale steeply with descriptor and sparse set size, motivating compression and parallelism techniques introduced in recent GAP frameworks (Klawohn et al., 2023).

A plausible implication is that, for applications requiring extensive sampling (e.g., nested sampling, large-scale Monte Carlo), alternative polynomial potentials such as MTP offer a better cost-precision trade-off, while SOAP-GAP is preferred for simulations where off-lattice or local accuracy is paramount, and model uncertainty needs quantification.

PDF Markdown Chat (Pro)

References (2)

Machine-learned Interatomic Potentials for Alloys and Alloy Phase Diagrams (2019)

Gaussian Approximation Potentials: theory, software implementation and application examples (2023)

Follow Topic

Get notified by email when new papers are published related to Locally Trained SOAP-GAP Model.