CP Rank Selection via Sparsity-Inducing Priors

Updated 2 August 2025

The paper presents a sparsity-inducing framework using submodular functions and convex surrogates to accurately recover the CP tensor rank.
It details hierarchical Bayesian techniques and ARD priors that enable group-level shrinkage and automatic relevance determination in multiway data.
The approach balances sparse model order selection with robust estimation through cross-validation, nonconvex regularizers, and scalable variational inference.

A sparsity-inducing prior for CP (CANDECOMP/PARAFAC) rank selection is a probabilistic or regularization framework designed to identify a minimal number of nonzero CP tensor components, thus effectively determining the tensor's rank by inducing sparsity in the set of candidate factors. The construction and implementation of such priors connect convex and nonconvex optimization, hierarchical Bayes, submodular analysis, and modern information criteria, and enable automatic or adaptive model order determination in high-dimensional multiway data.

1. Submodular Functions, Convex Envelopes, and Structured Norms

The selection of a small number of active CP tensor factors—a proxy for CP rank—can be approached as a combinatorial selection problem. Instead of directly minimizing the number of nonzero components, structured sparsity-inducing penalties use a set function $F: 2^V \to \mathbb{R}_+$ , where $V$ indexes candidate factors, to encode both sparsity and prior structural constraints. $F$ is chosen to be nondecreasing and submodular (i.e., $F(A) + F(B) \geq F(A \cup B) + F(A \cap B)$ for all $A, B \subseteq V$ ), which generalizes the cardinality function.

The relaxation of the combinatorial penalty $F(\operatorname{supp}(w))$ is achieved through its convex envelope on the $\ell_\infty$ ball, constructed as the Lovász extension $f(w)$ :

$f(w) = \sum_{k=1}^{p} w_{j_k} \left[ F(\{j_1, \ldots, j_k\}) - F(\{j_1, \ldots, j_{k-1}\}) \right]$

for $w \in \mathbb{R}_+^p$ ordered as $w_{j_1} \geq w_{j_2} \geq \cdots \geq w_{j_p} \geq 0$ (Bach, 2010). The resulting polyhedral norm $\Omega(w) = f(|w|)$ provides a convex surrogate tailored to the desired interaction between sparsity and structure.

In the context of CP rank selection, $F$ can specifically penalize the inclusion of additional rank-one components (e.g., $F(A) = |A|$ yields the L1 norm), or encode block/groups, hierarchies, or couplings among factors to favor more structured low-rank solutions (Bach, 2010). The theoretical advantage is that the support of the minimizer under such penalties is a stable set for $F$ , with support recovery consistency under general conditions.

2. Hierarchical Bayesian and Group Priors

Hierarchical Bayesian frameworks provide sparse priors through Gaussian scale mixtures, where each parameter (e.g., a vector of CP factor weights or loadings) $\beta_j$ receives a prior:

$\beta_j \mid \tau_j^2 \sim \mathcal{N}(0, \tau_j^2)$ ,
$\tau_j^2 \sim \text{Exp}(\lambda^2/2)$ (Exponential or Inverse-Gamma) (Lee et al., 2010).

Marginalizing over the scale parameter yields Laplace or generalized t-distributions. Placing an additional hierarchy—such as an inverse-gamma prior at the next layer—produces heavy-tailed, nonconvex sparsity-inducing priors that can adapt to both large signals and noise, introducing less bias for significant CP components (Lee et al., 2010). For grouped variables (e.g., all weights for a given CP component), group-level scales allow entire components to be “turned on/off,” inducing block sparsity and facilitating group-level CP rank selection.

The maximum a posteriori (MAP) estimate under such priors is computed by iterative reweighting or expectation-maximization (EM), with adaptive penalties calibrated by current parameter estimates. This approach is computationally efficient and naturally extends to the group sparse CP decomposition setting (Lee et al., 2010).

3. Automatic Relevance Determination and Fully Bayesian CP Rank Estimation

Automatic relevance determination (ARD) priors penalize columns of factor matrices across all CP modes by assigning each component a common latent precision $\lambda_r$ , with hierarchical Gamma hyperpriors:

For each mode $n$ ,

$p(A^{(n)} \mid \lambda) = \prod_{i_n=1}^{I_n} \mathcal{N}(a_{i_n}^{(n)} \mid 0, \mathrm{diag}(\lambda_1^{-1}, \ldots, \lambda_R^{-1}))$

$p(\lambda_r) = \operatorname{Gamma}(c_0, d_0)$ (Zhao et al., 2014).

As $\lambda_r \to \infty$ , the $r$ th component is shrunk to zero jointly across all modes. This yields a fully Bayesian CP decomposition with automatic rank selection: initialize with $R$ large, and the posterior inference prunes superfluous columns. Deterministic variational Bayesian (VB) algorithms provide closed-form parameter updates for all posteriors, scaling linearly with the number of observed entries. Predictive Student-t posteriors are naturally produced for imputation and uncertainty quantification in missing data scenarios (Zhao et al., 2014).

Rigid choices of sparsity priors (e.g., Gaussian-gamma or Laplace) may fail for high-rank or low-SNR tensors. The generalized hyperbolic (GH) prior introduces additional flexibility, including parameters that tune both the sharpness at the origin and tail behavior. The prior for each group (column across modes for a CP component) is written as

$\text{GH}(\{U^{(n)}_{(:,l)}\}_{n=1}^N \mid a_l^0, b_l^0, \lambda_l^0)$

with the Gaussian scale mixture representation:

$p(\{U^{(n)}_{(:,l)}\}) = \int \mathcal{N}(\mathrm{vec}(\{U^{(n)}_{(:,l)}\}) \mid 0, z_l I) \, \operatorname{GIG}(z_l \mid a_l^0, b_l^0, \lambda_l^0) dz_l$

This allows consistent and robust automatic pruning of components even in highly challenging settings, outperforming Gaussian-gamma priors in recovering both low and high tensor ranks at varying SNRs (Cheng et al., 2020). Variational Bayes inference yields closed-form updates for all parameters and latent variables, ensuring scalability.

5. Nonconvex, Polyhedral, and Adaptive Regularizers

Beyond convex relaxations, nonconvex sparsity-inducing regularizers such as group penalties $R(B) = \sum_i \xi_i \rho(\|b_i\|_2)$ with choices like the Geman function $\rho_{\text{GM}}(|x|) = |x|/(\theta + |x|)$ further sharpen support and remove estimation bias for large coefficients (Zhao et al., 2018). These penalties more decisively eliminate unnecessary CP components when an over-complete parameterization is provided, and can be solved efficiently with alternating minimization and majorization–minimization algorithms.

Recent work also develops frameworks to systematically generate sparsity-inducing regularizers with closed-form proximity or thresholding operators, enabling scalable optimization in both matrix and tensor (low-tubal-rank) completion problems. When applied to the singular values (or tubes), such regularizers act as nonconvex but computationally efficient rank surrogates, outperforming convex nuclear norm-based surrogates in many settings (Wang et al., 2023, Wang et al., 2023).

6. Calibration, Cross-Validation, and Information-Theoretic Model Selection

Model selection criteria addressing both sparsity and rank minimization must correctly adjust for the data-driven selection effect. Fixing regularization parameters (e.g., $\lambda$ in Lasso-type penalties) across CV folds can yield inconsistent selection of sparsity patterns and ranks (She et al., 2018). This motivates cross-validation on the structural selection-projection pattern (e.g., active and projected CP factors) rather than on penalty magnitude, with minimax-optimal and scale-free information criteria calibrated to match the theoretical error bound:

$\text{Risk} \asymp \sigma^2 \left\{ [\min(q, J) + m - r] \cdot r + J \log (ep/J) \right\}$

where $J$ is active support size and $r$ is the (CP) rank. This framework ensures principled and reproducible rank and sparsity selection, bypassing the need for separate noise estimation (She et al., 2018).

7. Trade-offs, Limitations, and Practical Recommendations

Practical deployment of sparsity-inducing priors for CP rank selection must address the trade-off between sparsity (support size, parsimony) and the risk of discarding significant components. Iterative or cutting-plane strategies that incrementally enforce rank constraints or refine penalties enable exploration of the Pareto front between sparsity and model complexity (Fampa et al., 2020). While convex (polyhedral) penalties afford strong theoretical guarantees and scalable algorithms, nonconvex approaches can further enhance estimation accuracy but pose challenges with local minima and initialization sensitivity.

The choice of prior or penalty should be informed by empirical testing: moment-based or kurtosis-based tests can diagnose deviation from Laplace (L1) assumptions and prompt adaptive switching to $\ell_q$ or other generalized power priors (Griffin et al., 2017).

Summary Table: Methodological Approaches

Technique	Principal Feature	CP Rank Selection Mechanism
Submodular/Lovász Extensions	Structured convex surrogate via set function F	Polyhedral norm sparsity, support recovery
Hierarchical Bayesian (HAL, ARD)	Gaussian scale mixtures; group & adaptive penalties	Group-wise shrinkage, MAP, Bayesian pruning
Generalized Hyperbolic (GH)	Flexible, heavy-tailed prior via Gaussian mixtures	Robust ARD; improved high-rank/low-SNR rec.
Nonconvex Regularizers	Bias-reduced, sharp thresholding (e.g., Geman, closed-form prox)	Aggressive component elimination, efficiency
Cross-validation/Information	Calibrated, scale-free, selection-pattern-based CV	Ranking by structural error, minimax optimal

The construction and calibration of sparsity-inducing priors for CP rank selection synthesizes convex geometry, submodular analysis, Bayesian inference, and algorithmic optimization. These techniques collectively enable both accurate rank estimation and robust, interpretable CP decompositions in practical multiway data analysis.