Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Locally Trained SOAP-GAP Model

Updated 11 November 2025
  • The paper introduces a machine-learned interatomic potential that uses SOAP descriptors and Gaussian process regression to predict energies, forces, and virials with quantum accuracy.
  • The methodology leverages active learning and local filtering of datasets to enhance accuracy and transferability within specific configurational regions.
  • Practical applications include large-scale molecular dynamics and Monte Carlo simulations, balancing computational efficiency with high-fidelity results.

A Locally Trained SOAP-GAP (Smooth Overlap of Atomic Positions - Gaussian Approximation Potential) model is a machine-learned interatomic potential constructed by fitting Gaussian process regression to local atomic environments, defined via the SOAP descriptor. This model targets specific configurational or spatial regions (such as defect cores, surfaces, or alloy concentrations) to achieve enhanced accuracy and transferability within the selected local domain. The model enables quantum-accurate prediction of energies, forces, and virials for atomistic simulations, with practical deployment in large-scale molecular dynamics, Monte Carlo sampling, and related techniques.

1. Theoretical Foundations

A SOAP-GAP model is predicated on the principle that the total potential energy of an atomic system can be decomposed as a sum over local atomic contributions, each represented as a feature vector derived from the local environment:

Etot=iEi,E_\text{tot} = \sum_i E_i \,,

where each EiE_i is modeled as a function of the local atomic neighborhood using the SOAP representation and a kernel-based regression within the Gaussian process (GP) framework.

Neighbor Density and Basis Expansion

The local environment for atom ii is encoded via a smeared neighbor density

ρiα(r)=jnbrsδzj,αexp(rrij22σ2)fcut(rij)\rho_i^\alpha(\mathbf{r}) = \sum_{j \in \text{nbrs}} \delta_{z_j, \alpha} \exp\left( -\frac{\left|\mathbf{r} - \mathbf{r}_{ij}\right|^2}{2\sigma^2} \right) f_\text{cut}(|\mathbf{r}_{ij}|)

where zjz_j is the chemical identity, σ\sigma is the atomic Gaussian width, and fcutf_\text{cut} is a smooth radial cutoff. This density is projected onto a set of orthogonal radial basis functions gn(r)g_n(r) and spherical harmonics Ylm(r^)Y_{lm}(\hat{r}):

ρα(r)=nlmcnlmαgn(r)Ylm(r^)\rho^\alpha(\mathbf{r}) = \sum_{nlm} c_{nlm}^\alpha\, g_n(|r|)\, Y_{lm}(\hat{r})

SOAP Power Spectrum

Permutationally, translationally, and rotationally invariant features are constructed from the expansion coefficients as the SOAP power spectrum:

pnnlαβ=m=ll[cnlmα]cnlmβp_{nn'l}^{\alpha\beta} = \sum_{m=-l}^l [c_{nlm}^\alpha]^* c_{n'lm}^\beta

All pnnlαβp_{nn'l}^{\alpha\beta} are concatenated into a single descriptor vector pp for each atomic environment.

SOAP-GAP Kernel

Similarity between environments is measured by a normalized dot-product polynomial kernel:

k(p,p)=(pp)ζ ,k(p, p') = (p \cdot p')^\zeta\ ,

with typical exponents ζ=2,4\zeta = 2, 4. Normalization ensures k(p,p)=1k(p, p) = 1.

Gaussian Process Regression

Each local atomic energy is expanded over MM "sparse" (inducing) environments XmX_m as

Ei=m=1Mαmk(pi,pm)E_i = \sum_{m=1}^M \alpha_m k(p_i, p_m)

with weights αm\alpha_m found by solving the regularized linear system:

(K+ϵI)α=y(K + \epsilon I) \alpha = y

where Kmm=k(pm,pm)K_{mm'} = k(p_m, p_{m'}), yy contains target observables (energies, forces, virials), and ϵ\epsilon encapsulates regularization parameters for different observables.

2. Data Preparation and Active Learning Workflow

Locally trained SOAP-GAP models require an appropriately curated and labeled dataset. The principal steps in database construction, as implemented for Ag-Pd alloys (Rosenbrock et al., 2019), are:

  1. Initial Dataset Generation
    • Enumerate fcc- and bcc-based supercells up to 4 atoms for multiple compositions.
  2. Preliminary MTP Fit
    • Fit a polynomial MTP to DFT-computed structures.
  3. Active Learning Iteration
    • Relax each structure; high extrapolation grades trigger DFT calculation and inclusion in the training set.
  4. Expansion
    • Extend to larger cells (up to 12 atoms), yielding 10,850 structures and 774 unique DFT-relaxed configurations.
  5. High-Precision DFT Evaluation
    • All 774 training configurations re-evaluated at tight DFT settings (k-point density 0.015\sim0.015 Å1^{-1}, EDIFF=10810^{-8}).
  6. Locality Filtering
    • Restrict training references by "config_type" (e.g., "surf" for surface, or select regions in alloys), if fitting a local model (Klawohn et al., 2023).

No off-lattice or liquid data is included during training for solid-phase models.

3. Descriptor, Sparsification, and Kernel Construction

Descriptor construction and sparsification are the computational bottlenecks for SOAP-GAP models. The procedure is as follows:

  • SOAP Descriptor Evaluation
    • For each atom in each frame, evaluate pnnlαβp_{nn'l}^{\alpha\beta} within a cutoff rcutr_\text{cut} (e.g., $4.5$–$5.0$ Å).
    • Descriptor dimensionality scales as S2nmax2lmaxS^2 n_\text{max}^2 l_\text{max}, where SS is the number of elements.
  • Compression Techniques
    • Apply embedding (e.g., ZmixZ_\text{mix}, RmixR_\text{mix}) to map chemical and radial indices to lower KK-dimensional representation.
    • Use reduced forms such as νR\nu_R, νS\nu_S schemes to achieve linear scaling with KK.
  • Sparse Point Selection
    • Choose MM representative environments (e.g., MM=500–2000) via CUR, kk-means, random, or uniform sampling.
    • Remove near-duplicates according to a jitter threshold.
  • Kernel Assembly
    • Build kernel blocks KMMK_{MM} (M×MM\times M) and KNMK_{NM} (Ntrain×MN_\text{train}\times M).
    • Compute derivatives for force and stress learning.

4. Model Fitting, Regularization, and Validation

The fitting procedure specifically addresses the high-dimensional kernel regression while ensuring stability and generalization:

  • Regularization
    • Distinct regularization (σ\sigma) per observable: typical settings are σenergy=103\sigma_\text{energy}=10^{-3} eV/atom, σforce=103\sigma_\text{force}=10^{-3} eV/Å, σvirial=0.02\sigma_\text{virial}=0.02 eV/atom.
    • Additional small jitter (108\sim 10^{-8}) added to kernel diagonals.
  • Parallel Fit and Hyperparameter Optimization
    • Distributed descriptor computation and linear algebra across O(102)O(10^2) MPI ranks. Use ScaLAPACK QR for scalable solves (Klawohn et al., 2023).
    • Hyperparameters (cutoff, basis resolution, kernel exponent, regularization strength) tuned by cross-validation on held-out frames (10–20% test split).
  • Model Validation
    • Reported benchmarks (Rosenbrock et al., 2019):
    • On liquid MD test set (6000 snapshots): energy RMSE =15.4= 15.4 meV/atom, force RMSE =224= 224 meV/Å, virial RMSE =8.3= 8.3 meV/Å3^3.
    • Phonon spectrum integrated RMSE: 0.13\sim 0.13 THz, train/test parity.
    • Transition path "swap" test: SOAP-GAP correctly predicts physical barriers without spurious minima.
  • Posterior Variance Diagnostics
    • The prediction variance, σ(x)\sigma(x), identifies out-of-sample areas where model uncertainty is large.

5. Implementation and Practical Usage

A locally trained SOAP-GAP model is typically implemented using the QUIP/GAP suite (Klawohn et al., 2023), with a workflow as follows:

  • Configuration File Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
atoms_filename    = my_local_dataset.xyz
gap_file          = local_SOAPGAP.xml
sparse_jitter     = 1e-8
default_kernel_regularisation = {0.001 0.05 0.1 0.0}
config_type_parameter_name    = config_type
config_type_kernel_regularisation = {surf:0.0005:0.02:0.05:0.0}
gap = {
  soap
    cutoff                   = 5.0
    cutoff_transition_width  = 0.5
    n_max                    = 8
    l_max                    = 6
    atom_gaussian_width      = 0.5
    soap_exponent            = 4
    n_sparse                 = 2000
    sparse_method            = cur_points
    covariance_type          = dot_product
    energy_scale             = 1.0
}

  • Fitting Invocation:

1
mpirun -np 64 gap_fit config_file=config

When focusing on a local region, ensure that configurations are tagged accordingly (e.g., config_type="surf").

  • Hardware and Scaling:
    • Descriptor memory scales as Nenvdim(p)N_\text{env} \cdot \dim(p), with kernel memory as NtargetsMN_\text{targets} \cdot M.
    • Fitting Nenv105N_\text{env}\sim 10^5 with M104M\sim 10^4 is feasible using MPI and descriptor compression.
    • Training is typically completed within hours on O(100)\mathcal{O}(100) CPU cores.
  • Post-Fitting Deployment:
    • Validated XML models are compatible with ASE, LAMMPS, and other atomistic simulation environments.

6. Computational Cost and Comparative Performance

SOAP-GAP offers DFT-like accuracy but at significantly reduced computational cost relative to ab initio methods:

Model Typical RMSE (energy) Relative Evaluation Speed Suitable for
SOAP-GAP $15.4$ meV/atom 10³–10⁴× faster than DFT Off-lattice modeling
MTP $15$–$20$ meV/atom >10×>10\times faster than GAP Phase diagrams, large sampling

Evaluating a slice with 2\sim2 billion potential calls is feasible with MTP but remains prohibitive with SOAP-GAP. GAP evaluations scale as O(Nbasis×Nsparse)\mathcal{O}(N_\text{basis} \times N_\text{sparse}) and are several times slower per atom than MTP, but orders of magnitude faster than on-the-fly DFT (Rosenbrock et al., 2019).

7. Limitations and Domain of Applicability

Locally trained SOAP-GAP models, while excelling in local accuracy and transferability within their training domain, are limited in their extrapolation ability outside of that space. The exclusion of off-lattice and liquid configurations in training restricts the model’s generalization to high-temperature or disordered phases. Model quality is strongly dependent on choice and diversity of training data, regularization strength, and descriptor completeness. The computational requirements for fitting SOAP-GAP scale steeply with descriptor and sparse set size, motivating compression and parallelism techniques introduced in recent GAP frameworks (Klawohn et al., 2023).

A plausible implication is that, for applications requiring extensive sampling (e.g., nested sampling, large-scale Monte Carlo), alternative polynomial potentials such as MTP offer a better cost-precision trade-off, while SOAP-GAP is preferred for simulations where off-lattice or local accuracy is paramount, and model uncertainty needs quantification.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Locally Trained SOAP-GAP Model.