Locally Trained SOAP-GAP Model
- The paper introduces a machine-learned interatomic potential that uses SOAP descriptors and Gaussian process regression to predict energies, forces, and virials with quantum accuracy.
- The methodology leverages active learning and local filtering of datasets to enhance accuracy and transferability within specific configurational regions.
- Practical applications include large-scale molecular dynamics and Monte Carlo simulations, balancing computational efficiency with high-fidelity results.
A Locally Trained SOAP-GAP (Smooth Overlap of Atomic Positions - Gaussian Approximation Potential) model is a machine-learned interatomic potential constructed by fitting Gaussian process regression to local atomic environments, defined via the SOAP descriptor. This model targets specific configurational or spatial regions (such as defect cores, surfaces, or alloy concentrations) to achieve enhanced accuracy and transferability within the selected local domain. The model enables quantum-accurate prediction of energies, forces, and virials for atomistic simulations, with practical deployment in large-scale molecular dynamics, Monte Carlo sampling, and related techniques.
1. Theoretical Foundations
A SOAP-GAP model is predicated on the principle that the total potential energy of an atomic system can be decomposed as a sum over local atomic contributions, each represented as a feature vector derived from the local environment:
where each is modeled as a function of the local atomic neighborhood using the SOAP representation and a kernel-based regression within the Gaussian process (GP) framework.
Neighbor Density and Basis Expansion
The local environment for atom is encoded via a smeared neighbor density
where is the chemical identity, is the atomic Gaussian width, and is a smooth radial cutoff. This density is projected onto a set of orthogonal radial basis functions and spherical harmonics :
SOAP Power Spectrum
Permutationally, translationally, and rotationally invariant features are constructed from the expansion coefficients as the SOAP power spectrum:
All are concatenated into a single descriptor vector for each atomic environment.
SOAP-GAP Kernel
Similarity between environments is measured by a normalized dot-product polynomial kernel:
with typical exponents . Normalization ensures .
Gaussian Process Regression
Each local atomic energy is expanded over "sparse" (inducing) environments as
with weights found by solving the regularized linear system:
where , contains target observables (energies, forces, virials), and encapsulates regularization parameters for different observables.
2. Data Preparation and Active Learning Workflow
Locally trained SOAP-GAP models require an appropriately curated and labeled dataset. The principal steps in database construction, as implemented for Ag-Pd alloys (Rosenbrock et al., 2019), are:
- Initial Dataset Generation
- Enumerate fcc- and bcc-based supercells up to 4 atoms for multiple compositions.
- Preliminary MTP Fit
- Fit a polynomial MTP to DFT-computed structures.
- Active Learning Iteration
- Relax each structure; high extrapolation grades trigger DFT calculation and inclusion in the training set.
- Expansion
- Extend to larger cells (up to 12 atoms), yielding 10,850 structures and 774 unique DFT-relaxed configurations.
- High-Precision DFT Evaluation
- All 774 training configurations re-evaluated at tight DFT settings (k-point density Å, EDIFF=).
- Locality Filtering
- Restrict training references by "config_type" (e.g., "surf" for surface, or select regions in alloys), if fitting a local model (Klawohn et al., 2023).
No off-lattice or liquid data is included during training for solid-phase models.
3. Descriptor, Sparsification, and Kernel Construction
Descriptor construction and sparsification are the computational bottlenecks for SOAP-GAP models. The procedure is as follows:
- SOAP Descriptor Evaluation
- For each atom in each frame, evaluate within a cutoff (e.g., $4.5$–$5.0$ Å).
- Descriptor dimensionality scales as , where is the number of elements.
- Compression Techniques
- Apply embedding (e.g., , ) to map chemical and radial indices to lower -dimensional representation.
- Use reduced forms such as , schemes to achieve linear scaling with .
- Sparse Point Selection
- Choose representative environments (e.g., =500–2000) via CUR, -means, random, or uniform sampling.
- Remove near-duplicates according to a jitter threshold.
- Kernel Assembly
- Build kernel blocks () and ().
- Compute derivatives for force and stress learning.
4. Model Fitting, Regularization, and Validation
The fitting procedure specifically addresses the high-dimensional kernel regression while ensuring stability and generalization:
- Regularization
- Distinct regularization () per observable: typical settings are eV/atom, eV/Å, eV/atom.
- Additional small jitter () added to kernel diagonals.
- Parallel Fit and Hyperparameter Optimization
- Distributed descriptor computation and linear algebra across MPI ranks. Use ScaLAPACK QR for scalable solves (Klawohn et al., 2023).
- Hyperparameters (cutoff, basis resolution, kernel exponent, regularization strength) tuned by cross-validation on held-out frames (10–20% test split).
- Model Validation
- Reported benchmarks (Rosenbrock et al., 2019):
- On liquid MD test set (6000 snapshots): energy RMSE meV/atom, force RMSE meV/Å, virial RMSE meV/Å.
- Phonon spectrum integrated RMSE: THz, train/test parity.
- Transition path "swap" test: SOAP-GAP correctly predicts physical barriers without spurious minima.
- Posterior Variance Diagnostics
- The prediction variance, , identifies out-of-sample areas where model uncertainty is large.
5. Implementation and Practical Usage
A locally trained SOAP-GAP model is typically implemented using the QUIP/GAP suite (Klawohn et al., 2023), with a workflow as follows:
- Configuration File Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
atoms_filename = my_local_dataset.xyz
gap_file = local_SOAPGAP.xml
sparse_jitter = 1e-8
default_kernel_regularisation = {0.001 0.05 0.1 0.0}
config_type_parameter_name = config_type
config_type_kernel_regularisation = {surf:0.0005:0.02:0.05:0.0}
gap = {
soap
cutoff = 5.0
cutoff_transition_width = 0.5
n_max = 8
l_max = 6
atom_gaussian_width = 0.5
soap_exponent = 4
n_sparse = 2000
sparse_method = cur_points
covariance_type = dot_product
energy_scale = 1.0
} |
- Fitting Invocation:
1 |
mpirun -np 64 gap_fit config_file=config |
When focusing on a local region, ensure that configurations are tagged accordingly (e.g., config_type="surf").
- Hardware and Scaling:
- Descriptor memory scales as , with kernel memory as .
- Fitting with is feasible using MPI and descriptor compression.
- Training is typically completed within hours on CPU cores.
- Post-Fitting Deployment:
- Validated XML models are compatible with ASE, LAMMPS, and other atomistic simulation environments.
6. Computational Cost and Comparative Performance
SOAP-GAP offers DFT-like accuracy but at significantly reduced computational cost relative to ab initio methods:
| Model | Typical RMSE (energy) | Relative Evaluation Speed | Suitable for |
|---|---|---|---|
| SOAP-GAP | $15.4$ meV/atom | 10³–10⁴× faster than DFT | Off-lattice modeling |
| MTP | $15$–$20$ meV/atom | faster than GAP | Phase diagrams, large sampling |
Evaluating a slice with billion potential calls is feasible with MTP but remains prohibitive with SOAP-GAP. GAP evaluations scale as and are several times slower per atom than MTP, but orders of magnitude faster than on-the-fly DFT (Rosenbrock et al., 2019).
7. Limitations and Domain of Applicability
Locally trained SOAP-GAP models, while excelling in local accuracy and transferability within their training domain, are limited in their extrapolation ability outside of that space. The exclusion of off-lattice and liquid configurations in training restricts the model’s generalization to high-temperature or disordered phases. Model quality is strongly dependent on choice and diversity of training data, regularization strength, and descriptor completeness. The computational requirements for fitting SOAP-GAP scale steeply with descriptor and sparse set size, motivating compression and parallelism techniques introduced in recent GAP frameworks (Klawohn et al., 2023).
A plausible implication is that, for applications requiring extensive sampling (e.g., nested sampling, large-scale Monte Carlo), alternative polynomial potentials such as MTP offer a better cost-precision trade-off, while SOAP-GAP is preferred for simulations where off-lattice or local accuracy is paramount, and model uncertainty needs quantification.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free