Papers
Topics
Authors
Recent
Search
2000 character limit reached

Permutationally Invariant Polynomial Regression

Updated 3 December 2025
  • PIP Regression is a framework that constructs regression models ensuring invariance under input permutations, which is critical for symmetric physical systems.
  • It employs algebraic techniques like minimal algebra generators and Hironaka decomposition to generate, purify, and prune polynomial bases for accurate potential energy surface fitting.
  • The method supports linear and neural network implementations, yielding high accuracy with efficient computation across applications from small clusters to condensed-phase systems.

Permutationally Invariant Polynomial (PIP) Regression is a rigorous framework for constructing regression models and machine-learned representations that enforce invariance under permutations of input variables, typically identical atoms or features. The methodology is foundational in high-dimensional modeling of symmetric physical systems, especially for potential energy surface (PES) fitting in quantum chemistry, but its abstract formulation renders it broadly applicable in any domain requiring permutation symmetry.

1. Algebraic Structure and Mathematical Foundations

Let V≅CdV\cong \mathbb{C}^d be a complex vector space carrying an action of a group GG (often GG is a product of symmetry groups, e.g., SO(d,C)×SnSO(d,\mathbb{C})\times S_n). The coordinate algebra C[Vn]\mathbb{C}[V^n] consists of polynomials in nn vectors, and the subalgebra of GG-invariant polynomials is

C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.

Two central constructions organize the generators of these invariant polynomials:

  • Minimal Algebra Generators (MAG): A finite set {g1,…,gr}\{g_1,\dots,g_r\} such that every f∈C[Vn]Gf\in\mathbb{C}[V^n]^G is a unique polynomial function of the GG0 and none can be generated from the others. For physical applications (e.g., momenta with Lorentz and permutation invariance), the existence of such generators follows from Weyl’s theorem and algebraic geometry.
  • Hironaka Decomposition (HD): A free-module decomposition of the invariant algebra:

GG1

where primaries GG2 are algebraically independent invariants and secondaries GG3 form a module basis.

Such generators and decompositions allow every GG4-invariant function to be represented as unique polynomials, facilitating regression or function approximation while ensuring exact symmetry.

2. Basis Construction, Purification, and Pruning

Given a system with GG5 symmetric sites (e.g., atoms), the canonical choice for input features is the set of inter-site distances, often mapped via Morse-type transforms:

GG6

For a target polynomial degree GG7, monomials in the GG8 are constructed, subject to GG9, and then symmetrized using the group-average:

GG0

where GG1 permutes indices as per symmetry. In molecular applications, the symmetry group GG2 is a product of symmetric groups corresponding to chemically equivalent atoms (e.g., H: GG3, O: GG4, etc.) (Houston et al., 2024).

A key refinement—purification—removes basis polynomials that do not vanish in correct dissociation limits (e.g., when clusters separate into noninteracting fragments). This is essential for many-body expansion and transferability (Nandi et al., 2021). In practice, raw PIP bases, which can be vast (GG5 terms for 20 atoms, degree 3), are pruned via ranking by maximal dataset value or variance, and then compacted for numerical efficiency (Houston et al., 2021, Houston et al., 2024).

3. Regression Procedures and Analytical Differentiation

Linear PIP Regression

Let the PIP basis contain GG6 symmetrized polynomials. The regression model for a scalar target (e.g., energy) is:

GG7

with GG8 as purified, symmetrized monomials (Morse or inverse transforms as above). The regression problem minimizes a combined loss over energies and, potentially, forces:

GG9

where weights and regularization are chosen according to dataset scale and feature range (Pandey et al., 2024, Houston et al., 2024). Forces are analytically available via reverse-mode differentiation on the computational graph for SO(d,C)×SnSO(d,\mathbb{C})\times S_n0, enabling rapid and precise gradient evaluation (Houston et al., 2021).

Analytical Gradients via Reverse-Mode

If SO(d,C)×SnSO(d,\mathbb{C})\times S_n1, with each SO(d,C)×SnSO(d,\mathbb{C})\times S_n2 built from SO(d,C)×SnSO(d,\mathbb{C})\times S_n3 as intermediates, adjoints SO(d,C)×SnSO(d,\mathbb{C})\times S_n4 are computed recursively:

SO(d,C)×SnSO(d,\mathbb{C})\times S_n5

Forces (Cartesian derivatives) follow by one extra loop over the chain of intermediates:

SO(d,C)×SnSO(d,\mathbb{C})\times S_n6

This yields cost scaling SO(d,C)×SnSO(d,\mathbb{C})\times S_n7 per geometry for energy and force (Houston et al., 2021).

Neural-Network Extensions

Networks incorporating PIP inputs (PIP-NN) use the symmetric basis as input features followed by one or more hidden layers with nonlinear activation (e.g., Swish), further increasing flexibility while preserving symmetry (Finenko, 2022).

4. Approximation Theorems and Universality

The approximation power of PIP-based regressors is underpinned by symmetry-adapted versions of the Stone–Weierstrass theorem. For SO(d,C)×SnSO(d,\mathbb{C})\times S_n8 a reductive group acting on SO(d,C)×SnSO(d,\mathbb{C})\times S_n9, any continuous C[Vn]\mathbb{C}[V^n]0-invariant function can be uniformly approximated by PIP polynomials, and networks using MAGs or HDs as input generators can achieve uniform approximation over compacts:

  • With MAG input, universal approximation holds for single-layer networks in MAGs.
  • With HD, the two-stage construction (primaries in hidden layers, secondary combinations in output) yields superior performance (Haddadin, 2021).

5. Representative Applications and Benchmarks

5.1 Molecular Potential Energy Surfaces

  • Water Tetramer 4-body Potential: PIP regression using a purified basis yields RMS errors of C[Vn]\mathbb{C}[V^n]1 cmC[Vn]\mathbb{C}[V^n]2 for 4-body interactions, outperforming MB-pol's TTM4-F in hexamer, heptamer, and larger cluster tests (Nandi et al., 2021).
  • Aspirin (21 atoms): Pruned third-order PIP basis (49,978 terms) fit to C[Vn]\mathbb{C}[V^n]3 constraints (energies + forces) achieves energy RMSE C[Vn]\mathbb{C}[V^n]4 cmC[Vn]\mathbb{C}[V^n]5, force RMSE C[Vn]\mathbb{C}[V^n]6 cmC[Vn]\mathbb{C}[V^n]7/bohr, superior to ACE and PaiNN, nearly matching Allegro (Houston et al., 2024).
  • Ethanol MD17: PIP regression achieves energy RMSE C[Vn]\mathbb{C}[V^n]8 kcal/mol and force RMSE C[Vn]\mathbb{C}[V^n]9 kcal/mol·Ånn0; evaluation time of nn1s per energy+force is 10–1000× faster than sGDML, ANI, PhysNet, ACE, and related methods (Houston et al., 2021, Finenko, 2022).
  • Hnn2Onn3 Anion: PIP (degree 6, nn4 terms) trained on 6000 configurations gives test-set RMSE 0.06 kcal/mol (energy), 0.12 kcal/mol Ånn5 (forces), single-point evaluation 2–5 μs, nn6300× faster than sGDML (Pandey et al., 2024).
  • Many-body Water (MB-pol): PIP-MB-pol 2B and 3B corrections employ fourth-degree polynomials, yielding test RMSEs nn7 kcal/mol (2B) and nn8 kcal/mol (3B), below chemical accuracy (Nguyen et al., 2018).

5.2 Δ-Machine Learning

PIP regression enables efficient correction of DFT-level PES to CCSD(T) accuracy by fitting the difference surface using a much smaller PIP basis, with high-fidelity reached using hundreds (not thousands) of expensive high-level reference points (Nandi et al., 2020).

5.3 Regression without Correspondences

Permutation-invariant polynomial regression facilitates linear regression where the order of labels is scrambled. By extracting permutation-invariant constraints (e.g., using power-sum polynomials), the regression parameters are solutions to a zero-dimensional polynomial system (bounded by nn9, for GG0 parameters). Algorithmic refinement via EM renders this feasible for moderately sized problems, offering theoretical guarantees on uniqueness and robustness to noise (Tsakiris et al., 2018).

6. Model Selection, Training Protocols, and Bayesian Evidence

PIP-based models naturally accommodate energy and force constraints through overdetermined linear least-squares, with or without regularization (ridge). For neural architectures, Bayesian model comparison via nested sampling confirms that HD-based networks outperform both MAGs and basic invariants in loss and evidence; higher parameter-counts favor HD (Haddadin, 2021).

Model Validation Loss (MSE) Bayesian Evidence Z
Weyl GG1–GG2 GG3–GG4
MAG GG5–GG6 GG7–GG8
HD GG9–C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.0 C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.1–C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.2

Empirical validation consistently finds that HD networks generalize best when width C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.3 number of secondaries is achieved; shallow-and-wide architectures sufficed once basis size exceeded critical threshold (Haddadin, 2021).

7. Computational Scaling, Implementation, and Best Practices

For moderate-sized molecules (up to C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.425 atoms, degree C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.53–4), purified PIP bases yield numerically stable least-squares, and evaluation costs (energy and analytic forces) are essentially linear in number of basis terms, with O(C[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.6s) timings on commodity CPUs (Houston et al., 2021, Houston et al., 2024). Force regression is efficiently handled via reverse-mode autodiff (Houston et al., 2021). For large systems, screening and parallelization (over tetramers, etc.) enables O(NC[Vn]G={f∈C[Vn]∣f(g⋅p1,…,g⋅pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.7) many-body expansions to be tractable in condensed phase.

Basis construction best practices include:

  • Transforming raw distances to Morse/inverse variables prior to polynomial generation.
  • Screening by variance, followed by aggressive pruning to remove negligible or ill-conditioned terms.
  • Purifying to enforce correct fragmentation limits for transferability.
  • Selecting weights for force versus energy regression to balance accuracy.
  • Small ridge regularization for high-dimensional solves (C[Vn]G={f∈C[Vn]∣f(gâ‹…p1,…,gâ‹…pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.8–C[Vn]G={f∈C[Vn]∣f(gâ‹…p1,…,gâ‹…pn)=f(p1,…,pn),  ∀g∈G}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.9) (Pandey et al., 2024).

8. Scope, Limitations, and Future Directions

The combinatorial growth in basis size with number of symmetric atoms and polynomial degree constitutes the principal scalability constraint—efficient pruning, purification, and modularization (e.g., many-body decomposition) remain critical for deployment beyond moderate {g1,…,gr}\{g_1,\dots,g_r\}0.

Group-theoretical generalizations (MAGs, HDs), tailored symmetrization (e.g., partial symmetry), and hybrid models (PIP-NN, PIP with physics-informed regularization) are active areas. Recent approximation theorems guarantee completeness; empirical analyses validate universal accuracy and computational tractability. Extensions to other symmetry groups and data domains remain open, alongside improved software and integration with quantum simulation workflows.

References

  • "Invariant polynomials and machine learning" (Haddadin, 2021)
  • "A CCSD(T)-based permutationally invariant polynomial 4-body potential for water" (Nandi et al., 2021)
  • "No Headache for PIPs: A PIP Potential for Aspirin Outperforms Other Machine-Learned Potentials" (Houston et al., 2024)
  • "{g1,…,gr}\{g_1,\dots,g_r\}1-Machine Learning for Potential Energy Surfaces: A PIP approach to bring a DFT-based PES to CCSD(T) Level of Theory" (Nandi et al., 2020)
  • "Assessing PIP and sGDML Potential Energy Surfaces for H3O2-" (Pandey et al., 2024)
  • "Comparison of permutationally invariant polynomials, neural networks, and Gaussian approximation potentials in representing water interactions through many-body expansions" (Nguyen et al., 2018)
  • "An algebraic-geometric approach for linear regression without correspondences" (Tsakiris et al., 2018)
  • "Accurate neural-network-based fitting of full-dimensional two-body potential energy surfaces" (Finenko, 2022)
  • "Permutationally invariant polynomial regression for energies and gradients...speed-up with high precision..." (Houston et al., 2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Permutationally Invariant Polynomial (PIP) Regression.