Sparse Multivariate Polynomial Identification

Updated 29 September 2025

Sparse multivariate polynomial identification is the process of reconstructing minimal-support multivariate polynomials using limited evaluations, bridging algebra, signal processing, and computational complexity.
Key algorithmic strategies include randomized Kronecker substitutions, modular reductions, and derivative-based methods that mitigate term collisions and optimize query complexity.
Practical implementations span computer algebra systems, cryptography, and sparse regression models, demonstrating the topic's broad impact on both theory and applied sciences.

Sparse multivariate polynomial identification is the problem of reconstructing the minimal-support representation of a multivariate polynomial, typically expressed in the monomial basis or another sparse-friendly basis, using as few resources—such as black-box evaluations or algebraic queries—as possible. This area bridges algorithmic algebra, signal processing, compressed sensing, and computational complexity, driven by the challenge of learning high-degree, high-dimensional polynomials when only a small subset of monomials is nonzero.

1. Problem Formulation and Mathematical Model

Sparse multivariate polynomial identification involves the recovery of a polynomial

$f(x_1, \ldots, x_n) = \sum_{j=1}^T c_j\, \mathbf{x}^{\mathbf{e}_j},\qquad \mathbf{e}_j \in \mathbb{N}^n,\ c_j \neq 0$

from limited, typically black-box, information. Here $T$ is the (unknown but small) support size, and the exponents $\mathbf{e}_j$ and coefficients $c_j$ are unknown.

Two key settings are encountered:

Black-box/Oracle Model: $f$ is only available via evaluations on chosen points in $\mathbb{F}^n$ (finite field or another domain), possibly modulo a modulus or at floating-point accuracy.
Compressed Representation Model: $f$ is provided as a straight-line program, modular black box, or via circuit, without explicit monomial enumeration.

This problem generalizes sparse univariate interpolation, but the multivariate setting presents significant challenges due to combinatorial explosion of potential monomials, term collisions under substitution, and complexity-theoretic barriers.

2. Algorithmic Strategies

A spectrum of algorithmic techniques addresses sparse multivariate polynomial identification, each with distinct structural and computational trade-offs:

2.1 Randomized Kronecker Substitutions and Term Diversification

Randomized Kronecker Substitution: Rather than the classical substitution $x_i = z^{D^{i-1}}$ (leading to degree $D^n$ ), randomized projections use $x_i = z^{s_i}$ with $s_i$ selected from intervals or primes dependent on T, substantially lowering the degree of the resulting univariate image, at the cost of potential term collisions. Multiple independent substitutions and collision analysis (e.g., via Chernoff bounds) ensure each term appears collision-free in at least half the images, enabling support recovery (Arnold et al., 2014, Huang et al., 2017).
Diversification: To uniquely tag terms and resolve ambiguous collisions, each variable is additionally scaled by randomly chosen nonzero field elements, so coefficient collisions in different images can be distinguished by their (now unique, random) values (Arnold et al., 2014, Huang, 2020).

2.2 Substitution and Modular Reduction

Wrapping and Modular Reduction: Large exponents are compressed by reducing variables modulo $x^{p_i} - 1$ , for carefully chosen $p_i$ , to facilitate collision avoidance and modular arithmetic over small rings, enabling quasi-optimal algorithms in the modular black-box model (Arnold et al., 2014, Hoeven et al., 2023). Chinese Remaindering amalgamates the modular images.
Cyclic Modular Projection: The multivariate $f$ is mapped to a univariate problem using cyclic projections $x_i = t^{\tau_i}$ with randomly chosen exponents and modulus $t^r - 1$ ; this enables efficient DFT-based recovery of exponents and coefficients (Hoeven et al., 2023).

2.3 Term Testing and Recursive Peeling

Recursion and Term Testing: Given the ability to reconstruct a "T-approximation" of $f$ (with up to $T$ terms), a recursive algorithm identifies a portion of the terms, subtracts their contribution, and repeats until all monomials are found. Deterministic and probabilistic criteria—such as drop in support size upon removal of candidate monomials—are used to robustly test candidate terms (Huang et al., 2017).

2.4 Derivative-Based and Prony Methods

Derivative-Based Identification: For polynomials given by straight-line programs, suitable evaluations of derivatives and shifted variables allow the construction of Hankel or Toeplitz systems whose solution yields term exponents and coefficients, especially when the field characteristic is large (Huang, 2020).

3. Complexity, Sparsity, and Hardness Barriers

The complexity of sparse multivariate polynomial identification depends critically on the sparsity parameter $T$ , the number of variables $n$ , degree bounds $D$ , bit-size of coefficients, and the algebraic structure (e.g., finite fields, integers, rationals):

For $T$ fixed (constant), randomized algorithms achieve bit-complexity polynomial in $n$ , $\log D$ , and the bit-size of coefficients, with total number of probes $O(nT)$ —near the information-theoretic optimum (Arnold et al., 2014, Hoeven et al., 2023).
Deterministic algorithms generally require more evaluations or higher arithmetic complexity; modulus-changing Kronecker substitutions can achieve linear or near-linear dependence on $D$ when $nT \leq D$ (Huang et al., 2017).
Randomized algorithms relying on term diversification and random projection achieve expected complexity $\tilde{O}(nT\sqrt{D}\log q + nT\log^2q)$ (Huang, 2020), outperforming earlier approaches (which are at least linear in $D$ ).
Over finite fields, deriving algorithms matching the information-theoretic minimum number of probes—and with polynomial bit-complexity in input size—requires careful handling of collisions, use of sufficiently large fields, and derandomization of certain projection steps.
For highly structured polynomials or in the presence of constraints (such as known coefficient sets (Huang et al., 2017), or invariance under root systems (Hubert et al., 2020)), further complexity reductions are possible via problem-specific reductions.

Some identification problems—when $T$ is part of the input—are provably NP-hard, by direct reduction from classical minimum distance and syndrome decoding (coding theory) or subset-sum problems (Giesbrecht et al., 2010).

4. Extensions: Special Bases, System Settings, and Structural Sparsity

Sparse identification is not limited to monomial bases. Key extensions include:

Generalized Polynomial Bases: Identification in the Schubert basis or multivariate Chebyshev polynomials leverages group invariance and explicit Hankel operator methods for efficient support recovery (Mukhopadhyay et al., 2015, Hubert et al., 2020).
Volterra and Polynomial Regression Models: In nonlinear system identification and inference, polynomial expansions are applied to time series with large combinatorial feature sets. Sparse regularization (Lasso, weighted Lasso, and RLS-type adaptive algorithms) enables recovery with far fewer measurements than unknowns, provided Restricted Isometry Properties (RIP) are satisfied with high probability (Kekatos et al., 2011).
Optimization with Encoded Sparsity: For polynomials representable via a small number of linear forms, gradient rank and subspace analysis allow both detection of intrinsic sparsity and reduction of the optimization problem's dimensionality—from $n$ to $m \ll n$ —greatly lowering computational cost for polynomial optimization over convex domains (Lasserre, 2022).
Noisy and Tensor Models: Weighted tensor decomposition and canonical polyadic decomposition (CPD) expand sparse polynomial identification to settings where coefficients are estimated with uncertainty. Covariance-informed weighting and alternated least squares provide robustness for decoupling multivariate polynomials into sums of univariate functions, with applications such as parallel Wiener-Hammerstein system identification (Hollander et al., 2016).

5. Practical Implementations and Experimental Demonstrations

A number of prototype and scalable implementations highlight the practical tractability of the described techniques:

Voxelize and Subdivision Algorithms: For zero set identification in sparse systems, the Compressed Sparse Fiber (CSF) data structure enables quasi-linear amortized evaluation and efficient subdivision, enabling enclosure of solution sets even for trivariate degree-100 systems, previously intractable by standard methods (Moroz, 14 Jun 2024).
Software Prototypes: Implementations in Maple, Python, and specialized frameworks validate the theoretical complexity bounds, demonstrating scaling to high-degree, high-dimensional polynomials.
System Identification: Applications to automotive electronic throttle control and other biomedical or engineering systems leverage high-order derivative reconstruction, sparse polynomial fitting, and model selection via statistical error percentiles to overcome traditional overparameterization in black-box modeling (Alamir, 22 Sep 2025).
Meta-Modeling and Uncertainty Quantification: Canonical low-rank approximations and sparse polynomial chaos expansions serve as surrogate models for complex simulations, with empirical results confirming superior conditional generalization error in predicting extreme responses (Konakli et al., 2015).

6. Applications and Implications

Sparse multivariate polynomial identification is fundamental to multiple domains:

Computational Algebra and Symbolic Computation: Algorithms for sparse interpolation, system solving, and factorization underpin modern computer algebra systems and are crucial in the development of efficient algebraic processors.
Coding Theory and Cryptography: Recovery of sparse multiples and identification of sparse factors are intimately linked to syndrome decoding, minimum weight codeword finding, and the security analysis of stream ciphers based on LFSRs (Giesbrecht et al., 2010).
Signal Processing and Compressive Sensing: The identification of sparse models aligns naturally with compressed sampling and limited-data recovery paradigms, with direct consequences for real-time inference and sparse regression (Kekatos et al., 2011, Alamir, 22 Sep 2025).
Optimization, Machine Learning, and Statistics: Detection and exploitation of low-dimensional structure in polynomials accelerate optimization routines in high-dimensional probabilistic models, facilitate manifold learning (Lasserre, 2022), and support genome-wide association analyses (Kekatos et al., 2011).

7. Open Problems and Research Frontiers

Key research directions and open questions include:

Derandomization: Many state-of-the-art algorithms are Monte Carlo; derandomizing Kronecker substitution, irreducibility-preserving projections, and collision avoidance in the black-box model remain challenging (Huang et al., 2017, Dutta et al., 26 Nov 2024).
Unconditional Hardness: Proving NP-hardness for general sparse identification problems when $T$ is part of the input, particularly in the multivariate case, is of central complexity-theoretic interest (Giesbrecht et al., 2010).
Structured/Group Sparsity and Hierarchical Bases: Exploiting structured forms of sparsity (group, block, hierarchical) or tailored bases (e.g., Schubert, Chebyshev) could further improve efficiency and accuracy.
Optimal Sample Complexity and Query Complexity: Tight lower and upper bounds for the minimal number of evaluations, especially in noisy and finite field contexts, is an ongoing research topic (Huang, 2020).
Large-Scale and Parallel Implementations: Adapting amortized and recursive schemes for implementation on modern parallel hardware and exploring the limits of existing number-theoretic structures in large dimensions remain open practical challenges (Hoeven et al., 2023, Moroz, 14 Jun 2024).
Factorization and Sparse Factor Recovery: Reducing sparse factor recovery to polynomial identity and divisibility testing—particularly derandomizing irreducibility-preserving projections in the sparse setting—remains an open problem with broad implications for pseudorandomness and circuit complexity (Dutta et al., 26 Nov 2024).

Sparse multivariate polynomial identification thus remains a vibrant research area at the intersection of symbolic computation, algorithmic mathematics, and high-dimensional data science, with impactful ongoing advances in theoretical, algorithmic, and applied dimensions.