Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 22 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 463 tok/s Pro
Kimi K2 213 tok/s Pro
2000 character limit reached

Global–Local Shrinkage Priors in High-Dim Models

Updated 26 August 2025
  • Global–local shrinkage priors are continuous Bayesian priors that combine a global parameter with local scales, enabling adaptive sparse estimation.
  • They employ hierarchical representations with a shared global factor and coefficient-specific local factors, ensuring efficient Gibbs sampling in high dimensions.
  • Recent advances like Dirichlet–Laplace priors improve concentration near sparse vectors and outperform traditional methods such as the Bayesian Lasso in simulations.

Global–local shrinkage priors are a class of continuous Bayesian priors designed for parameter estimation and variable selection in high-dimensional models, particularly under sparsity. These priors induce strong shrinkage of most coefficients toward zero via a global parameter, while local scaling parameters allow a minority of truly nonzero effects to escape over-shrinkage. Developed as computable, continuous alternatives to discrete spike-and-slab priors, global–local shrinkage priors include many well-known models—such as the Bayesian Lasso, horseshoe, and Dirichlet–Laplace—and have had significant impact on Bayesian regression, large-scale inference, and structured variable selection across scientific disciplines.

1. Mathematical Formulation and Hierarchical Structure

A global–local (GL) shrinkage prior represents each coefficient θⱼ as a scale mixture of normals:

θjψj,τN(0,ψjτ),ψjf,τg\theta_j \mid \psi_j, \tau \sim N(0, \psi_j \tau), \qquad \psi_j \sim f, \qquad \tau \sim g

where:

  • τ\tau is a global scale (shared shrinkage parameter),
  • ψj\psi_j is a local scale (coefficient-specific, allowing “escape” from shrinkage).

This separates the total prior variance into a product of a global factor and coefficient-specific local factors. Well-known instances include:

  • Bayesian Lasso: ψjExp(λ)\psi_j \sim \mathrm{Exp}(\lambda) gives a Laplace marginal on θj\theta_j.
  • Horseshoe: ψj1/2,τ1/2\psi_j^{1/2}, \tau^{1/2} half–Cauchy.
  • Dirichlet–Laplace: Local scales ϕj\phi_j on the simplex, see below.

Such hierarchical representations, especially when all pieces are chosen to admit Gibbs updating, are critical for scalable inference in high-dimensional scenarios.

2. Theoretical Properties and Concentration Behavior

A principal theoretical concern is the prior’s and posterior’s concentration around vectors with only a few nonzero components—i.e., sparse recovery. The paper (Bhattacharya et al., 2012) shows:

  • If only global shrinkage (e.g., ψj1\psi_j \equiv 1) is used, even with optimally heavy-tailed priors on τ\tau, the mass the prior assigns to sparse vectors is only polynomially small (Theorem 3.2).
  • For typical choices (e.g., Bayesian Lasso: ψjexp(λ)\psi_j \sim \exp(\lambda)), even with one nonzero θj\theta_j, the prior mass in an L2L_2-ball around the sparse vector decays exponentially in the ambient dimension (Theorem 3.3).

This reveals that traditional choices, such as Bayesian Lasso, are suboptimal for high-dimensional sparse settings: the prior fails to assign enough mass near sparse configurations, hampering posterior contraction at minimax rates.

3. Dirichlet–Laplace Priors: Construction and Optimality

To overcome these limitations, a new class—the Dirichlet–Laplace (DL) priors—is introduced (Bhattacharya et al., 2012). The key innovation is to “split” the global scale among coefficients by assigning a simplex-constrained ϕ=(ϕ1,,ϕn)Sn1\phi = (\phi_1, \ldots, \phi_n) \in \mathcal{S}^{n-1}:

θjϕj,τDE(ϕjτ),(ϕ1,,ϕn)Dir(a,,a),τg\theta_j \mid \phi_j, \tau \sim \mathrm{DE}(\phi_j \tau),\quad (\phi_1,\ldots,\phi_n) \sim \mathrm{Dir}(a,\ldots,a),\quad \tau \sim g

where DE(b)\mathrm{DE}(b) is a double-exponential with scale bb, and aa is typically set to $1/n$ or $1/2$ for concentrated shrinkage.

Optimality: The DL prior achieves a lower bound for prior mass within a ball around a sparse vector of the form:

P(θθ02<tn)exp{Clogn}P\left( \|\theta - \theta_0\|_2 < t_n \right) \geq \exp\left\{- C \sqrt{\log n} \right\}

(substantially better than the exponential decay of other priors; see Theorem 4.5). The marginal prior on each θj\theta_j exhibits a singularity at zero for a<1a < 1 (Proposition 4.6), concentrating substantial mass near zero without sacrificing heavy tails, analogous to the horseshoe.

4. Posterior Computation: Joint Updates and Sampling

Efficient inference under these priors leverages Gibbs sampling with the hierarchical normal-exponential mixture representation:

θjψj,ϕj,τN(0,ψjϕj2τ2),ψjexp(1/2),ϕDir(a,,a)\theta_j \mid \psi_j,\phi_j,\tau \sim N(0, \psi_j \phi_j^2 \tau^2),\quad \psi_j \sim \exp(1/2),\quad \phi \sim \mathrm{Dir}(a,\ldots,a)

τ\tau receives a prior gg (typically gamma or inverse-gamma).

A critical technical component is the joint update of the Dirichlet vector ϕ\phi using normalized random measure theory:

  • For each jj, sample TjT_j independently from a generalized inverse Gaussian (giG) distribution, then set ϕj=Tj/(jTj)\phi_j = T_j / \left( \sum_j T_j \right ).
  • This enables efficient block-updating of ϕ\phi and avoids the slow mixing of coordinate-wise updates, especially important in high dimensions (Theorem 4.7).

The overall Gibbs sampler cycles through updates for θ\theta, ψ\psi, ϕ\phi, and τ\tau.

5. Comparative Performance and Simulation Evidence

Finite-sample and simulation results in (Bhattacharya et al., 2012) show:

  • In settings with n=100n=100 or $200$, DL priors yield squared errors roughly half (or less) that of the Bayesian Lasso.
  • DL priors are competitive with the horseshoe and point mass mixtures, and may outperform the empirical Bayes median and Lasso for certain signal strengths and sparsity levels.
  • For n=1000n=1000 with one or a few large signals, DL priors continue to improve over traditional shrinkage priors, especially when hyperparameter aa is tuned (e.g., a=1/2a=1/2 as robust default).

This aligns with their theoretical concentration properties and supports their use in high-dimensional, sparse estimation.

6. Regular Variation, Default Prior Design, and Robustness

A central feature of effective global–local shrinkage priors, as elucidated in (Bhadra et al., 2015), is regular variation: the mixing distribution’s tails decay polynomially (not exponentially). This property (e.g., half–Cauchy for horseshoe priors) ensures:

  • Infinite spike at zero—promoting strong shrinkage of noise.
  • Heavy and regular-varying tails—avoiding bias in large signals.

Furthermore, the class of functions with regular variation is closed under many nonlinear transformations: e.g., the prior for a sum, maximum, product, or ratio of coefficients inherits regular variation from the base prior. This resolves the “noninformative prior paradox” for nonlinear functions of high-dimensional parameter vectors.

7. Extensions: Multivariate Structures, Group Priors, and Generalizations

Global–local shrinkage priors have been successfully adapted to:

  • Grouped predictors and hierarchical structures: Grouped global–local models extend the basic hierarchy by introducing group shrinkage factors, allowing for structured sparsity with overlapping or multilevel grouping (Xu et al., 2017).
  • Dynamic models: For time-series, dynamic shrinkage processes allow the local scale parameters to evolve temporally via latent autoregressive processes (Kowal et al., 2017).
  • Non-Gaussian observations: Adaptations for count data, spatial autoregression, and gamma observations require tailored prior construction (e.g., shape–scale inverse gamma mixtures (Hamura et al., 2022), global–local shrinkage for Poisson models (Hamura et al., 2019)).

In all cases, the core logic persists: a global scale strongly shrinks coefficients overall; local scales, chosen with heavy-tailed distributions or simplex constraints, enable essential flexibility and adaptivity for accurate signal recovery.


Summary Table: Typical Hierarchical Forms of GL Shrinkage Priors

Prior Local Scale Prior Global Scale Prior Key Features
Bayesian Lasso Exp(λ) Fixed or gamma Computationally simple, suboptimal for very sparse problems
Horseshoe Half–Cauchy Half–Cauchy Infinite spike/Heavy tails, regular variation, adapts to sparsity
Dirichlet–Laplace (DL) Dirichlet–simplex Gamma Optimal prior/posterior concentration, joint updates, simplex structure
Grouped Global–Local (e.g., Cauchy/Dir.) Global τ + group Allows for grouped/structured sparsity

References

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube