Posterior contraction in sparse Bayesian factor models for massive covariance matrices

Published 16 Jun 2012 in math.ST | (1206.3627v4)

Abstract: Sparse Bayesian factor models are routinely implemented for parsimonious dependence modeling and dimensionality reduction in high-dimensional applications. We provide theoretical understanding of such Bayesian procedures in terms of posterior convergence rates in inferring high-dimensional covariance matrices where the dimension can be larger than the sample size. Under relevant sparsity assumptions on the true covariance matrix, we show that commonly-used point mass mixture priors on the factor loadings lead to consistent estimation in the operator norm even when $p\gg n$. One of our major contributions is to develop a new class of continuous shrinkage priors and provide insights into their concentration around sparse vectors. Using such priors for the factor loadings, we obtain similar rate of convergence as obtained with point mass mixture priors. To obtain the convergence rates, we construct test functions to separate points in the space of high-dimensional covariance matrices using insights from random matrix theory; the tools developed may be of independent interest. We also derive minimax rates and show that the Bayesian posterior rates of convergence coincide with the minimax rates upto a $\sqrt{\log n}$ term.

Abstract PDF Upgrade to Chat

Summary

The paper derives precise posterior contraction rates for estimating high-dimensional covariance matrices using sparse Bayesian factor models, achieving near minimax optimality.
It compares point mass mixture and continuous shrinkage priors, revealing that continuous shrinkage provides practical and robust performance in ultra-high-dimensional settings.
The study utilizes novel test constructions based on random matrix theory and simulation experiments to validate its theoretical guarantees and computational effectiveness.

Posterior Contraction in Sparse Bayesian Factor Models for Massive Covariance Matrices

Introduction and Problem Setting

This work addresses the theoretical properties of posterior contraction in sparse Bayesian factor models for estimating high-dimensional covariance matrices, specifically focusing on situations where the ambient dimension $p$ significantly exceeds sample size $n$ ( $p \gg n$ ). The focus is on latent factor models for covariance estimation under sparsity: each observed vector $y_i \in \mathbb{R}^p$ is modeled as

$y_i = \Lambda \eta_i + \varepsilon_i, \quad \varepsilon_i \sim N_p(0, \Omega),$

where $\Lambda$ is a $p \times k$ factor loading matrix (with $k \ll p$ ), $\eta_i \sim N_k(0,I_k)$ , and $\Omega$ is diagonal. Marginally, the covariance takes the reduced form $\Sigma = \Lambda \Lambda^T + \Omega$ , drastically reducing the number of free parameters from $O(p^2)$ to $O(pk)$ . The key practical and theoretical challenge is to obtain non-trivial estimation guarantees when $p$ grows much faster than $n$ .

The analysis emphasizes "ultra"-high-dimensional settings and provides precise posterior contraction rates under sparsity assumptions and prior constructions that are relevant for genomic, neuroscience, and other modern high-dimensional datasets.

Assumptions and Prior Structures

The theoretical development rests on several assumptions:

Factor Structure on Truth: The true covariance has the form $\Sigma_{0n} = \Lambda_{0n} \Lambda_{0n}^T + \Omega_{0n}$ , with growing $p_n$ and possibly growing $k_{0n}$ .
Column Sparsity: Each column of $\Lambda_{0n}$ has at most $s_n$ nonzero components ( $s_n \ll p_n$ ), reflecting factor sparsity.
Pervasiveness and Conditioning: Spectral conditions on $\Lambda_{0n}$ , echoing "pervasive" factors in random matrix theory, with mild restrictions on the growth of the largest eigenvalue $c_n$ and minimal conditions on the residual variance.
Sample Size and Model Complexity: Conditions such that $c_n k_{0n}^{3/2} \sqrt{(s_n \log p_n/n) \sqrt{\log n}} \to 0$ , which allow $p_n$ to be of order $\exp(n^\alpha)$ for some $\alpha \in (0,1/5)$ under typical sparsity regimes.

Two types of priors for loadings are analyzed:

Point Mass Mixture Priors (Spike-and-Slab): Each entry is zero with high probability, otherwise drawn from a heavy-tailed distribution. This matches frequentist penalization with exact sparsity but is computationally challenging.
Continuous Shrinkage Priors: Hierarchical scale mixtures of Laplace (double exponential) distributions with global-local scales (motivated by the Horseshoe, Dirichlet-Laplace, etc.). These enable efficient MCMC without exact zeros.

Priors on $k$ favor small numbers of factors, and on $\sigma^2$ (residual variance) are diffuse but proper.

Main Theoretical Results

Posterior Contraction Rates

A central contribution is the derivation of explicit posterior contraction rates for estimating the covariance $\Sigma_{0n}$ in the operator (spectral) norm, under both point mass mixture and continuous shrinkage priors. The results can be summarized as follows:

Operator Norm Consistency: If $s_n \gtrsim \log p_n$ and $k_{0n} = O(1)$ , the posterior contracts at rate

$\varepsilon_n = c_n \sqrt{\frac{s_n \log p_n}{n}} \sqrt{\log n}$

in operator norm, i.e.,

$\Pi_n \left( \| \Sigma_n - \Sigma_{0n} \|_2 > M \varepsilon_n \mid y^{(n)} \right) \to 0$

in probability, for any sufficiently large $M$ . The dependence on $c_n$ encodes the "energy" in the largest eigenvalue.

General $k_{0n}$ Growth: If $k_{0n}$ is allowed to grow, the contraction rate becomes

$c_n k_{0n}^{3/2} \sqrt{\frac{s_n \log p_n}{n}} \sqrt{\log n}.$

Thus, sparsity and factor proliferation both contribute to statistical complexity.

Matching Minimax Lower Bounds: For fixed $k_{0n}$ , the posterior rates coincide with a new minimax lower bound (proved via Fano's method) up to an explicit $\sqrt{\log n}$ factor:

$\inf_{\hat{\Sigma}_n} \sup_{\Sigma_{0n}} \mathbb{E}\| \hat{\Sigma}_n - \Sigma_{0n} \|_2 \ge c_n \sqrt{ s_n \log p_n / n }.$

This demonstrates the theoretical efficiency of the Bayesian procedures under the specified conditions.

Robustness to Prior Specification: The optimal contraction rates are obtained for both point mass mixture and the proposed shrinkage priors, provided the priors allocate sufficient mass to neighborhoods of the true sparse loading vectors.

Priors and High-Dimensional Properties

The paper develops properties of the analyzed continuous shrinkage priors that guarantee:

Prior Concentration: Lower bounds for the prior probability of small balls around arbitrary $s$ -sparse vectors comparable to those for point mass mixture priors.
Effective Dimensionality Control: Exponential decay of the probability that the number of "large" (> $\delta$ ) entries in the loading is much bigger than order $s_n$ .
Tail Control: Subexponential deviation bounds on the $\ell_1$ norm of the priors, ensuring concentration on "reasonable" model sizes.

These results are nontrivial as traditional global-local shrinkage mechanisms do not guarantee the necessary localized prior mass in ultra-high dimensions.

Test Construction and Proof Techniques

The authors advance the theory of posterior contraction in non-Hellinger metrics by constructing nonparametric tests for covariance matrices using techniques from random matrix concentration inequalities. The construction leverages the fact that, under the factor model, the dominant contribution to the operator norm comes from the low-rank term, enabling projection-based tests with exponentially decaying type I and II errors in high dimensions.

Proofs build on detailed metric entropy calculations, sharp bounds on prior masses, and control of test errors, encompassing random matrix theory and empirical process theory tools.

Numerical Experiments

The simulation studies compare the proposed Bayesian methods with frequentist techniques such as POET and adaptive thresholding, evaluating covariance estimation accuracy in operator norm under varying $p, n, k_{0n}, s_n$ , for both well-specified and misspecified noise structures.

The continuous shrinkage prior outperforms or matches other methods in all settings, especially as model complexity increases, and is robust to the absence of exact sparsity or diagonal noise. Point mass mixture priors deteriorate in very large models, primarily due to computational mixing issues. The empirical results thus reinforce the theoretical findings.

Implications and Future Directions

The theoretical advances provide a framework for principled Bayesian inference for covariance estimation under realistic high-dimensional settings with latent structure and sparsity. The results justify the use of continuous shrinkage priors as computational surrogates for spike-and-slab approaches, with guarantees matching minimax frequentist rates up to log-factors.

Several avenues for further research are apparent:

Extension to Approximate Factor Models: Relaxing structural assumptions to allow for non-diagonal idiosyncratic variance and weakly sparse low-rank structure.
Adaptive or Empirical Bayes Procedures: Prior hyperparameter tuning for optimal adaptation to unknown sparsity and eigenvalue growth.
Posterior Convergence for Functional Parameters: Extension to linear or quadratic functionals of high-dimensional covariance (e.g., prediction under factor models).
Sharper Minimax Adaptivity: Investigation of potential improvements to eliminate the $\sqrt{\log n}$ factor and extensions to more general sparsity regimes.

Conclusion

This work provides a rigorous analysis of posterior contraction properties for a broad class of Bayesian factor models under sparsity, for covariance estimation in "ultra"-high-dimensional settings. By establishing precise rates, matching minimax lower bounds, and demonstrating practical computational robustness, it strongly supports the use of Bayesian shrinkage methods, both theoretically and empirically, in modern high-dimensional statistical inference (1206.3627).

Markdown