SOAP: Smooth Overlap of Atomic Positions

Updated 16 April 2026

SOAP is a rigorous atom-centered descriptor that transforms local 3D atomic environments into high-dimensional, rotation-, translation-, and permutation-invariant feature vectors.
Its formulation includes expanding atom-centered neighbor densities using radial functions and spherical harmonics, followed by constructing a power spectrum to capture up to three-body correlations.
Applications of SOAP span molecular energy prediction, phase mapping, and interatomic potential modeling, with adjustable hyperparameters balancing accuracy and computational cost.

The Smooth Overlap of Atomic Positions (SOAP) is a formally rigorous, atom-centered structural descriptor that encodes the three-dimensional local environment around atoms as a high-dimensional, rotation-, translation-, and permutation-invariant feature vector suitable for quantitative comparison and efficient machine learning of atomic-scale properties of molecules and solids. SOAP has been established as a state-of-the-art representation for atomistic machine learning across chemistry, materials science, and molecular physics, offering a systematic, differentiable, and complete mapping from geometric configurations to feature space.

1. Mathematical Construction of SOAP

The foundational object in SOAP is the atom-centered local neighbor density: $\rho_i(\mathbf{r}) = \sum_{j \in \mathrm{neigh}(i)} \exp\left(-\frac{|\mathbf{r} - \mathbf{r}_{ij}|^2}{2\sigma^2}\right) f_\mathrm{cut}(|\mathbf{r}_{ij}|)$ where $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ is the vector from atom $i$ to neighbor $j$ , $\sigma$ is the Gaussian width, and $f_\mathrm{cut}$ is a smooth cutoff function limiting the range to $r_\mathrm{cut}$ (McCorkindale et al., 2020, De et al., 2015, Rosenbrock et al., 2019, Himanen et al., 2019).

This neighbor density is then expanded in an orthonormal basis formed by radial functions $g_n(r)$ and spherical harmonics $Y_{lm}(\hat{\mathbf{r}})$ : $\rho_i(\mathbf{r}) = \sum_{n=1}^{N} \sum_{l=0}^{L} \sum_{m=-l}^{l} c^i_{nlm}\;g_n(r)\;Y_{lm}(\hat{\mathbf{r}})$ with expansion coefficients obtained by the projection

$\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 0

To obtain rotational invariance, the so-called power spectrum is constructed: $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 1 This transformation eliminates explicit orientation dependence, producing a descriptor vector $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 2 that encodes up to three-body correlations of the atomic density around the central atom (McCorkindale et al., 2020, De et al., 2015, Bartók et al., 2012, Willatt et al., 2018).

Comparison between two atomic environments is performed using the dot product of their normalized power spectra, forming a positive-definite kernel: $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 3 where $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 4 (typically $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 5 or $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 6) acts as a sharpening exponent (Rosenbrock et al., 2019, Himanen et al., 2019).

2. Hyperparameters, Basis Choices, and Implementation

SOAP's expressiveness and computational cost are governed by several key hyperparameters:

Gaussian width $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 7: Controls spatial resolution/smoothness (McCorkindale et al., 2020, De et al., 2015). Smaller values resolve fine structure but require higher angular cutoffs.
Cutoff radius $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 8: Sets environment size; critical for capturing relevant correlations (e.g., $\mathbf{r}_{ij} = \mathbf{r}_j - \mathbf{r}_i$ 9–6 Å in typical applications; up to $i$ 0– $i$ 1 Å for ice polymorph discrimination) (McCorkindale et al., 2020, Maity et al., 2024).
Radial basis size $i$ 2 and angular cutoff $i$ 3: Number of radial and angular functions limit the descriptor completeness. Common choices are $i$ 4- $i$ 5, $i$ 6- $i$ 7 (McCorkindale et al., 2020, Rosenbrock et al., 2019).
Choice of radial basis functions: Either orthonormalized Gaussians or polynomials (implemented via analytical or numerical quadrature) (Himanen et al., 2019).

Implementations such as DScribe rely on analytical integrals for GTO radial bases and exploit recursive relations for efficient generation of both radial overlap integrals and angular functions (Himanen et al., 2019, Caro, 2019). Caro's separable radial-angular approximation enables a tenfold speedup and improves numerical stability for distant neighbors without degrading regression power (Caro, 2019).

3. Multi-Species Extensions and Compression Strategies

SOAP straightforwardly generalizes to multi-component systems by constructing separate densities $i$ 8 for each element $i$ 9 and expanding in combined species-radial-angular bases. The resulting power spectrum includes cross-element terms $j$ 0 (De et al., 2015, Darby et al., 2021). The dimensionality grows as $j$ 1 ( $j$ 2 is the number of species), motivating compression schemes.

Recent work has demonstrated rank-deficient structure in the power spectrum, enabling lossless compression to $j$ 3 scaling via Gram-matrix techniques and projection (Darby et al., 2021). Further, reduced or agnostic variants combine element-specific and element-agnostic densities to lower scaling at small additional loss in fidelity, facilitating high-element-count applications (e.g., HEAs) (Darby et al., 2021).

4. Algorithms, Kernels, and Dataset-Level Similarity

SOAP kernels can be composed to compare whole molecules or periodic structures. The REMatch (Regularized Entropy Match) kernel provides a principled means of aggregating per-environment similarities using an optimal transport plan $j$ 4 over all atomic environments in two structures $j$ 5 and $j$ 6: $j$ 7 where $j$ 8 is the normalized environment-kernel matrix and $j$ 9 constrains $\sigma$ 0 to be doubly stochastic (De et al., 2015, McCorkindale et al., 2020). The entropic regularization $\sigma$ 1 tunes the assignment sharpness, and the Sinkhorn–Knopp algorithm is used for practical solution.

Kernel ridge regression (KRR) models based on SOAP-REMatch kernels have achieved sub-kcal/mol errors for molecular atomization energies—matching or exceeding previous Coulomb-matrix and deep neural network models (De et al., 2015).

5. Advanced and Generalized SOAP Variants

Tensorial (λ-SOAP): SOAP can be generalized to predict vector or tensor properties by constructing symmetry-adapted kernels that transform covariantly with the learning target under rotations. The $\sigma$ 2-SOAP(2) formalism contracts Clebsch–Gordan coupled spherical harmonics and supports accurate regression of molecular polarizabilities, dipoles, and response tensors (Grisafi et al., 2019).
Time-Resolved and Structure-Dynamics Analysis: TimeSOAP and combinations of SOAP with dynamical descriptors (e.g., LENS) quantify time-dependent changes in local environment, enabling detection of rare events, interface dynamics, and phase transitions (Caruso et al., 2023, Crippa et al., 2023).
Anisotropic SOAP (AniSOAP): For coarse-grained or orientational degrees of freedom, SOAP can incorporate oriented multivariate Gaussians, yielding power spectra and kernels that encode both position and orientation (with analytic integration for ellipsoids, coarse-grained mesogens, etc.) (Lin et al., 2024).
Compression for Scalability: Lossless and controlled-lossy compressions reduce SOAP power spectrum dimension from $\sigma$ 3 to $\sigma$ 4, enabling fitting ML potentials for systems with up to 40 chemical species without significant loss in regression accuracy (Darby et al., 2021).

6. Practical Applications and Empirical Performance

SOAP descriptors serve as input features for a variety of atomic-scale regression tasks:

Potential energy surfaces (PES): SOAP-driven Gaussian Approximation Potentials (GAP) achieve near-DFT accuracy for energies, forces, and virials of alloys and complex bulk systems (Rosenbrock et al., 2019).
Molecular property prediction: REMatch–SOAP kernels deliver chemical-accuracy (<1 kcal/mol MAE) in atomization energies across organic molecule datasets and generalize better in scaffold splits compared to both 2D graph and 3D radial-fingerprint baselines (McCorkindale et al., 2020, De et al., 2015).
Phase mapping and polymorph discrimination: High-dimensional SOAP vectors combined with dimensionality reduction (e.g., variational autoencoders) unambiguously resolve all known ice polymorphs and liquids, outperforming linear PCA and conventional order parameters (Maity et al., 2024).
Zeolites and complex frameworks: Kernel PCA of SOAP descriptors maps the structural landscape to correlatively highlight motif contributions to density and lattice energy, outperforming classical order-parameter sets (Helfrecht et al., 2019).
Structure-dynamics coupling: SOAP+LENS and TimeSOAP approaches reveal dynamical domains, transition pathways, and rare-events in molecular systems at single-atom resolution (Caruso et al., 2023, Crippa et al., 2023).

7. Limitations, Information Content, and Best-Practice Recommendations

SOAP features are maximally sensitive to up to three-body correlations by design; finite “quasi-constant” manifolds in SOAP feature space make some four-body (e.g., torsional) interactions invisible—manifesting as failures in machine-learned force fields for torsional energetics (Parsaeifard et al., 2021). Introducing explicit four-body terms or switching to many-body (Overlap Matrix) descriptors is required for rigorous learning in torsion-sensitive systems.

Systematic tests reveal that SOAP similarity is not strictly equivalent to traditional geometric order parameters (bond angles, coordination, hydrogen bonding) in water and other disordered systems; these descriptors each encode complementary aspects, and reducing the Gaussian width in SOAP increases information content but at the computational cost of larger feature vectors (Donkor et al., 2022). Including angular information (non-zero $\sigma$ 5) is essential; purely radial (E3FP-like) descriptors are sub-optimal (McCorkindale et al., 2020).

For optimal accuracy and efficiency, recommended practice is to (1) tune (or cross-validate) $\sigma$ 6, $\sigma$ 7, $\sigma$ 8, $\sigma$ 9 to the complexity of the target environments, (2) use standard ranges for molecular and condensed-phase systems (e.g., $f_\mathrm{cut}$ 0– $f_\mathrm{cut}$ 1, $f_\mathrm{cut}$ 2– $f_\mathrm{cut}$ 3, $f_\mathrm{cut}$ 4– $f_\mathrm{cut}$ 5 Å, $f_\mathrm{cut}$ 6– $f_\mathrm{cut}$ 7 Å), (3) exploit lossless compression for multi-element systems, and (4) normalize feature vectors prior to any regression or classification (McCorkindale et al., 2020, Darby et al., 2021, Maity et al., 2024).

References

"Investigating 3D Atomic Environments for Enhanced QSAR" (McCorkindale et al., 2020)
"Comparing molecules and solids across structural and alchemical space" (De et al., 2015)
"Machine-learned Interatomic Potentials for Alloys and Alloy Phase Diagrams" (Rosenbrock et al., 2019)
"TimeSOAP: Tracking high-dimensional fluctuations in complex molecular systems via time-variations of SOAP spectra" (Caruso et al., 2023)
"IceCoder: Identification of Ice phases in molecular simulation using variational autoencoder" (Maity et al., 2024)
"On representing chemical environments" (Bartók et al., 2012)
"A New Kind of Atlas of Zeolite Building Blocks" (Helfrecht et al., 2019)
"Quantum chemical roots of machine-learning molecular similarity descriptors" (Gugler et al., 2022)
"Manifolds of quasi-constant SOAP and ACSF fingerprints and the resulting failure to machine learn four-body interactions" (Parsaeifard et al., 2021)
"Do Machine-Learning Atomic Descriptors and Order Parameters Tell the Same Story? The Case of Liquid Water" (Donkor et al., 2022)
"Atom-Density Representations for Machine Learning" (Willatt et al., 2018)
"Compressing local atomic neighbourhood descriptors" (Darby et al., 2021)
"Machine-learning of atomic-scale properties based on physical principles" (Ceriotti et al., 2019)
"Atomic-scale representation and statistical learning of tensorial properties" (Grisafi et al., 2019)
"Machine learning of microscopic structure-dynamics relationships in complex molecular systems" (Crippa et al., 2023)
"DScribe: Library of Descriptors for Machine Learning in Materials Science" (Himanen et al., 2019)
"Optimizing many-body atomic descriptors for enhanced computational performance of machine learning based interatomic potentials" (Caro, 2019)
"Expanding Density-Correlation Machine Learning Representations for Anisotropic Coarse-Grained Particles" (Lin et al., 2024)