Papers
Topics
Authors
Recent
2000 character limit reached

Permutationally Invariant Polynomial Regression

Updated 3 December 2025
  • PIP Regression is a framework that constructs regression models ensuring invariance under input permutations, which is critical for symmetric physical systems.
  • It employs algebraic techniques like minimal algebra generators and Hironaka decomposition to generate, purify, and prune polynomial bases for accurate potential energy surface fitting.
  • The method supports linear and neural network implementations, yielding high accuracy with efficient computation across applications from small clusters to condensed-phase systems.

Permutationally Invariant Polynomial (PIP) Regression is a rigorous framework for constructing regression models and machine-learned representations that enforce invariance under permutations of input variables, typically identical atoms or features. The methodology is foundational in high-dimensional modeling of symmetric physical systems, especially for potential energy surface (PES) fitting in quantum chemistry, but its abstract formulation renders it broadly applicable in any domain requiring permutation symmetry.

1. Algebraic Structure and Mathematical Foundations

Let VCdV\cong \mathbb{C}^d be a complex vector space carrying an action of a group GG (often GG is a product of symmetry groups, e.g., SO(d,C)×SnSO(d,\mathbb{C})\times S_n). The coordinate algebra C[Vn]\mathbb{C}[V^n] consists of polynomials in nn vectors, and the subalgebra of GG-invariant polynomials is

C[Vn]G={fC[Vn]f(gp1,,gpn)=f(p1,,pn),  gG}.\mathbb{C}[V^n]^G = \{ f \in \mathbb{C}[V^n] \mid f(g\cdot p_1, \ldots, g\cdot p_n) = f(p_1,\ldots,p_n),\;\forall g\in G \}.

Two central constructions organize the generators of these invariant polynomials:

  • Minimal Algebra Generators (MAG): A finite set {g1,,gr}\{g_1,\dots,g_r\} such that every fC[Vn]Gf\in\mathbb{C}[V^n]^G is a unique polynomial function of the gig_i and none can be generated from the others. For physical applications (e.g., momenta with Lorentz and permutation invariance), the existence of such generators follows from Weyl’s theorem and algebraic geometry.
  • Hironaka Decomposition (HD): A free-module decomposition of the invariant algebra:

C[Vn]G=k=1pηkC[θ1,,θm]\mathbb{C}[V^n]^G = \bigoplus_{k=1}^p \eta_k \cdot \mathbb{C}[\theta_1,\ldots,\theta_m]

where primaries {θj}\{\theta_j\} are algebraically independent invariants and secondaries {ηk}\{\eta_k\} form a module basis.

Such generators and decompositions allow every GG-invariant function to be represented as unique polynomials, facilitating regression or function approximation while ensuring exact symmetry.

2. Basis Construction, Purification, and Pruning

Given a system with NN symmetric sites (e.g., atoms), the canonical choice for input features is the set of inter-site distances, often mapped via Morse-type transforms:

τij=exp(rij/a),a2 bohr\tau_{ij} = \exp(-r_{ij}/a), \quad a\sim2~\text{bohr}

For a target polynomial degree dd, monomials in the τij\tau_{ij} are constructed, subject to i<jαijd\sum_{i<j} \alpha_{ij} \leq d, and then symmetrized using the group-average:

Φp(R)=1GgGmα(gy)\Phi_p(R) = \frac{1}{|G|}\sum_{g\in G} m_{\alpha}(g\cdot y)

where gyg \cdot y permutes indices as per symmetry. In molecular applications, the symmetry group GG is a product of symmetric groups corresponding to chemically equivalent atoms (e.g., H: SnS_n, O: SmS_m, etc.) (Houston et al., 17 Jan 2024).

A key refinement—purification—removes basis polynomials that do not vanish in correct dissociation limits (e.g., when clusters separate into noninteracting fragments). This is essential for many-body expansion and transferability (Nandi et al., 2021). In practice, raw PIP bases, which can be vast (105\sim10^5 terms for 20 atoms, degree 3), are pruned via ranking by maximal dataset value or variance, and then compacted for numerical efficiency (Houston et al., 2021, Houston et al., 17 Jan 2024).

3. Regression Procedures and Analytical Differentiation

Linear PIP Regression

Let the PIP basis contain NbasisN_\mathrm{basis} symmetrized polynomials. The regression model for a scalar target (e.g., energy) is:

E(x;c)=k=1NbasisckBk(x)E(\mathbf{x};\mathbf{c}) = \sum_{k=1}^{N_\mathrm{basis}} c_k\,B_k(\mathbf{x})

with BkB_k as purified, symmetrized monomials (Morse or inverse transforms as above). The regression problem minimizes a combined loss over energies and, potentially, forces:

L(c)=nwEn(E(x(n);c)Erefn)2+nwFnxE(x(n);c)Frefn2+λc2L(\mathbf{c}) = \sum_n w_E^{n}\big(E(\mathbf{x}^{(n)};\mathbf{c}) - E^{n}_\mathrm{ref}\big)^2 + \sum_n w_F^{n} \|\nabla_{\mathbf{x}} E(\mathbf{x}^{(n)};\mathbf{c}) - \mathbf{F}^{n}_\mathrm{ref}\|^2 + \lambda \|\mathbf{c}\|^2

where weights and regularization are chosen according to dataset scale and feature range (Pandey et al., 17 Feb 2024, Houston et al., 17 Jan 2024). Forces are analytically available via reverse-mode differentiation on the computational graph for EE, enabling rapid and precise gradient evaluation (Houston et al., 2021).

Analytical Gradients via Reverse-Mode

If V=icipiV = \sum_i c_i p_i, with each pip_i built from tjt_j as intermediates, adjoints aj=V/tja_j = \partial V / \partial t_j are computed recursively:

aj=i>jaititj+{ckif tj=pk 0otherwisea_j = \sum_{i>j} a_i \frac{\partial t_i}{\partial t_j} + \begin{cases} c_k & \text{if } t_j = p_k \ 0 & \text{otherwise} \end{cases}

Forces (Cartesian derivatives) follow by one extra loop over the chain of intermediates:

Vxn,α=(i,j)aijτijrijrijxn,α,τijrij=τijλ\frac{\partial V}{\partial x_{n,\alpha}} = \sum_{(i, j)} a_{ij} \frac{\partial \tau_{ij}}{\partial r_{ij}} \frac{\partial r_{ij}}{\partial x_{n,\alpha}}, \quad \frac{\partial \tau_{ij}}{\partial r_{ij}} = -\frac{\tau_{ij}}{\lambda}

This yields cost scaling O(Nbasis)\mathcal{O}(N_\mathrm{basis}) per geometry for energy and force (Houston et al., 2021).

Neural-Network Extensions

Networks incorporating PIP inputs (PIP-NN) use the symmetric basis as input features followed by one or more hidden layers with nonlinear activation (e.g., Swish), further increasing flexibility while preserving symmetry (Finenko, 2022).

4. Approximation Theorems and Universality

The approximation power of PIP-based regressors is underpinned by symmetry-adapted versions of the Stone–Weierstrass theorem. For GG a reductive group acting on VCnV\cong\mathbb{C}^n, any continuous GG-invariant function can be uniformly approximated by PIP polynomials, and networks using MAGs or HDs as input generators can achieve uniform approximation over compacts:

  • With MAG input, universal approximation holds for single-layer networks in MAGs.
  • With HD, the two-stage construction (primaries in hidden layers, secondary combinations in output) yields superior performance (Haddadin, 2021).

5. Representative Applications and Benchmarks

5.1 Molecular Potential Energy Surfaces

  • Water Tetramer 4-body Potential: PIP regression using a purified basis yields RMS errors of $6.2$ cm1^{-1} for 4-body interactions, outperforming MB-pol's TTM4-F in hexamer, heptamer, and larger cluster tests (Nandi et al., 2021).
  • Aspirin (21 atoms): Pruned third-order PIP basis (49,978 terms) fit to >200,000>200,000 constraints (energies + forces) achieves energy RMSE $27$ cm1^{-1}, force RMSE $53$ cm1^{-1}/bohr, superior to ACE and PaiNN, nearly matching Allegro (Houston et al., 17 Jan 2024).
  • Ethanol MD17: PIP regression achieves energy RMSE $0.023$ kcal/mol and force RMSE $0.14$ kcal/mol·Å1^{-1}; evaluation time of 0.5μ0.5 \mus per energy+force is 10–1000× faster than sGDML, ANI, PhysNet, ACE, and related methods (Houston et al., 2021, Finenko, 2022).
  • H3_3O2_2^- Anion: PIP (degree 6, 1200\sim1200 terms) trained on 6000 configurations gives test-set RMSE 0.06 kcal/mol (energy), 0.12 kcal/mol Å1^{-1} (forces), single-point evaluation 2–5 μs, \sim300× faster than sGDML (Pandey et al., 17 Feb 2024).
  • Many-body Water (MB-pol): PIP-MB-pol 2B and 3B corrections employ fourth-degree polynomials, yielding test RMSEs $0.049$ kcal/mol (2B) and $0.046$ kcal/mol (3B), below chemical accuracy (Nguyen et al., 2018).

5.2 Δ-Machine Learning

PIP regression enables efficient correction of DFT-level PES to CCSD(T) accuracy by fitting the difference surface using a much smaller PIP basis, with high-fidelity reached using hundreds (not thousands) of expensive high-level reference points (Nandi et al., 2020).

5.3 Regression without Correspondences

Permutation-invariant polynomial regression facilitates linear regression where the order of labels is scrambled. By extracting permutation-invariant constraints (e.g., using power-sum polynomials), the regression parameters are solutions to a zero-dimensional polynomial system (bounded by n!n!, for nn parameters). Algorithmic refinement via EM renders this feasible for moderately sized problems, offering theoretical guarantees on uniqueness and robustness to noise (Tsakiris et al., 2018).

6. Model Selection, Training Protocols, and Bayesian Evidence

PIP-based models naturally accommodate energy and force constraints through overdetermined linear least-squares, with or without regularization (ridge). For neural architectures, Bayesian model comparison via nested sampling confirms that HD-based networks outperform both MAGs and basic invariants in loss and evidence; higher parameter-counts favor HD (Haddadin, 2021).

Model Validation Loss (MSE) Bayesian Evidence Z
Weyl 10210^{-2}10310^{-3} 103010^{-30}105010^{-50}
MAG 10310^{-3}10410^{-4} 102510^{-25}104510^{-45}
HD 10410^{-4}10510^{-5} 102010^{-20}104210^{-42}

Empirical validation consistently finds that HD networks generalize best when width \gtrsim number of secondaries is achieved; shallow-and-wide architectures sufficed once basis size exceeded critical threshold (Haddadin, 2021).

7. Computational Scaling, Implementation, and Best Practices

For moderate-sized molecules (up to \sim25 atoms, degree \leq3–4), purified PIP bases yield numerically stable least-squares, and evaluation costs (energy and analytic forces) are essentially linear in number of basis terms, with O(101μ10^{-1}\mus) timings on commodity CPUs (Houston et al., 2021, Houston et al., 17 Jan 2024). Force regression is efficiently handled via reverse-mode autodiff (Houston et al., 2021). For large systems, screening and parallelization (over tetramers, etc.) enables O(N4^4) many-body expansions to be tractable in condensed phase.

Basis construction best practices include:

  • Transforming raw distances to Morse/inverse variables prior to polynomial generation.
  • Screening by variance, followed by aggressive pruning to remove negligible or ill-conditioned terms.
  • Purifying to enforce correct fragmentation limits for transferability.
  • Selecting weights for force versus energy regression to balance accuracy.
  • Small ridge regularization for high-dimensional solves (λ108\lambda\sim 10^{-8}10610^{-6}) (Pandey et al., 17 Feb 2024).

8. Scope, Limitations, and Future Directions

The combinatorial growth in basis size with number of symmetric atoms and polynomial degree constitutes the principal scalability constraint—efficient pruning, purification, and modularization (e.g., many-body decomposition) remain critical for deployment beyond moderate NN.

Group-theoretical generalizations (MAGs, HDs), tailored symmetrization (e.g., partial symmetry), and hybrid models (PIP-NN, PIP with physics-informed regularization) are active areas. Recent approximation theorems guarantee completeness; empirical analyses validate universal accuracy and computational tractability. Extensions to other symmetry groups and data domains remain open, alongside improved software and integration with quantum simulation workflows.

References

  • "Invariant polynomials and machine learning" (Haddadin, 2021)
  • "A CCSD(T)-based permutationally invariant polynomial 4-body potential for water" (Nandi et al., 2021)
  • "No Headache for PIPs: A PIP Potential for Aspirin Outperforms Other Machine-Learned Potentials" (Houston et al., 17 Jan 2024)
  • "Δ\Delta-Machine Learning for Potential Energy Surfaces: A PIP approach to bring a DFT-based PES to CCSD(T) Level of Theory" (Nandi et al., 2020)
  • "Assessing PIP and sGDML Potential Energy Surfaces for H3O2-" (Pandey et al., 17 Feb 2024)
  • "Comparison of permutationally invariant polynomials, neural networks, and Gaussian approximation potentials in representing water interactions through many-body expansions" (Nguyen et al., 2018)
  • "An algebraic-geometric approach for linear regression without correspondences" (Tsakiris et al., 2018)
  • "Accurate neural-network-based fitting of full-dimensional two-body potential energy surfaces" (Finenko, 2022)
  • "Permutationally invariant polynomial regression for energies and gradients...speed-up with high precision..." (Houston et al., 2021)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Permutationally Invariant Polynomial (PIP) Regression.