Global–Local Shrinkage Priors in High-Dim Models
- Global–local shrinkage priors are continuous Bayesian priors that combine a global parameter with local scales, enabling adaptive sparse estimation.
- They employ hierarchical representations with a shared global factor and coefficient-specific local factors, ensuring efficient Gibbs sampling in high dimensions.
- Recent advances like Dirichlet–Laplace priors improve concentration near sparse vectors and outperform traditional methods such as the Bayesian Lasso in simulations.
Global–local shrinkage priors are a class of continuous Bayesian priors designed for parameter estimation and variable selection in high-dimensional models, particularly under sparsity. These priors induce strong shrinkage of most coefficients toward zero via a global parameter, while local scaling parameters allow a minority of truly nonzero effects to escape over-shrinkage. Developed as computable, continuous alternatives to discrete spike-and-slab priors, global–local shrinkage priors include many well-known models—such as the Bayesian Lasso, horseshoe, and Dirichlet–Laplace—and have had significant impact on Bayesian regression, large-scale inference, and structured variable selection across scientific disciplines.
1. Mathematical Formulation and Hierarchical Structure
A global–local (GL) shrinkage prior represents each coefficient θⱼ as a scale mixture of normals:
where:
- is a global scale (shared shrinkage parameter),
- is a local scale (coefficient-specific, allowing “escape” from shrinkage).
This separates the total prior variance into a product of a global factor and coefficient-specific local factors. Well-known instances include:
- Bayesian Lasso: gives a Laplace marginal on .
- Horseshoe: half–Cauchy.
- Dirichlet–Laplace: Local scales on the simplex, see below.
Such hierarchical representations, especially when all pieces are chosen to admit Gibbs updating, are critical for scalable inference in high-dimensional scenarios.
2. Theoretical Properties and Concentration Behavior
A principal theoretical concern is the prior’s and posterior’s concentration around vectors with only a few nonzero components—i.e., sparse recovery. The paper (Bhattacharya et al., 2012) shows:
- If only global shrinkage (e.g., ) is used, even with optimally heavy-tailed priors on , the mass the prior assigns to sparse vectors is only polynomially small (Theorem 3.2).
- For typical choices (e.g., Bayesian Lasso: ), even with one nonzero , the prior mass in an -ball around the sparse vector decays exponentially in the ambient dimension (Theorem 3.3).
This reveals that traditional choices, such as Bayesian Lasso, are suboptimal for high-dimensional sparse settings: the prior fails to assign enough mass near sparse configurations, hampering posterior contraction at minimax rates.
3. Dirichlet–Laplace Priors: Construction and Optimality
To overcome these limitations, a new class—the Dirichlet–Laplace (DL) priors—is introduced (Bhattacharya et al., 2012). The key innovation is to “split” the global scale among coefficients by assigning a simplex-constrained :
where is a double-exponential with scale , and is typically set to $1/n$ or $1/2$ for concentrated shrinkage.
Optimality: The DL prior achieves a lower bound for prior mass within a ball around a sparse vector of the form:
(substantially better than the exponential decay of other priors; see Theorem 4.5). The marginal prior on each exhibits a singularity at zero for (Proposition 4.6), concentrating substantial mass near zero without sacrificing heavy tails, analogous to the horseshoe.
4. Posterior Computation: Joint Updates and Sampling
Efficient inference under these priors leverages Gibbs sampling with the hierarchical normal-exponential mixture representation:
receives a prior (typically gamma or inverse-gamma).
A critical technical component is the joint update of the Dirichlet vector using normalized random measure theory:
- For each , sample independently from a generalized inverse Gaussian (giG) distribution, then set .
- This enables efficient block-updating of and avoids the slow mixing of coordinate-wise updates, especially important in high dimensions (Theorem 4.7).
The overall Gibbs sampler cycles through updates for , , , and .
5. Comparative Performance and Simulation Evidence
Finite-sample and simulation results in (Bhattacharya et al., 2012) show:
- In settings with or $200$, DL priors yield squared errors roughly half (or less) that of the Bayesian Lasso.
- DL priors are competitive with the horseshoe and point mass mixtures, and may outperform the empirical Bayes median and Lasso for certain signal strengths and sparsity levels.
- For with one or a few large signals, DL priors continue to improve over traditional shrinkage priors, especially when hyperparameter is tuned (e.g., as robust default).
This aligns with their theoretical concentration properties and supports their use in high-dimensional, sparse estimation.
6. Regular Variation, Default Prior Design, and Robustness
A central feature of effective global–local shrinkage priors, as elucidated in (Bhadra et al., 2015), is regular variation: the mixing distribution’s tails decay polynomially (not exponentially). This property (e.g., half–Cauchy for horseshoe priors) ensures:
- Infinite spike at zero—promoting strong shrinkage of noise.
- Heavy and regular-varying tails—avoiding bias in large signals.
Furthermore, the class of functions with regular variation is closed under many nonlinear transformations: e.g., the prior for a sum, maximum, product, or ratio of coefficients inherits regular variation from the base prior. This resolves the “noninformative prior paradox” for nonlinear functions of high-dimensional parameter vectors.
7. Extensions: Multivariate Structures, Group Priors, and Generalizations
Global–local shrinkage priors have been successfully adapted to:
- Grouped predictors and hierarchical structures: Grouped global–local models extend the basic hierarchy by introducing group shrinkage factors, allowing for structured sparsity with overlapping or multilevel grouping (Xu et al., 2017).
- Dynamic models: For time-series, dynamic shrinkage processes allow the local scale parameters to evolve temporally via latent autoregressive processes (Kowal et al., 2017).
- Non-Gaussian observations: Adaptations for count data, spatial autoregression, and gamma observations require tailored prior construction (e.g., shape–scale inverse gamma mixtures (Hamura et al., 2022), global–local shrinkage for Poisson models (Hamura et al., 2019)).
In all cases, the core logic persists: a global scale strongly shrinks coefficients overall; local scales, chosen with heavy-tailed distributions or simplex constraints, enable essential flexibility and adaptivity for accurate signal recovery.
Summary Table: Typical Hierarchical Forms of GL Shrinkage Priors
Prior | Local Scale Prior | Global Scale Prior | Key Features |
---|---|---|---|
Bayesian Lasso | Exp(λ) | Fixed or gamma | Computationally simple, suboptimal for very sparse problems |
Horseshoe | Half–Cauchy | Half–Cauchy | Infinite spike/Heavy tails, regular variation, adapts to sparsity |
Dirichlet–Laplace (DL) | Dirichlet–simplex | Gamma | Optimal prior/posterior concentration, joint updates, simplex structure |
Grouped Global–Local | (e.g., Cauchy/Dir.) | Global τ + group | Allows for grouped/structured sparsity |
References
- For foundational formulation, theory, and Dirichlet–Laplace priors: (Bhattacharya et al., 2012)
- For regular variation, default priors, and nonlinear functionals: (Bhadra et al., 2015)
- For grouped models and multilevel shrinkage: (Xu et al., 2017)
- For dynamic shrinkage in time series: (Kowal et al., 2017)
- For multivariate regression with posterior consistency: (Bai et al., 2017)
- For variable selection consistency and penalized credible regions: (Zhang et al., 2016)
- For practical posterior computation formulations and simulation evidence: (Bhattacharya et al., 2012, Pfarrhofer et al., 2018, Womack et al., 2019)