High-Dimensional Bayesian Probit Regression

Updated 4 June 2026

High-Dimensional Bayesian Probit Regression is a framework for binary and categorical response modeling that leverages latent variable augmentation and sparsity-inducing priors to overcome high predictor-to-observation ratios.
It employs unified skew-normal conjugacy and tailored Gibbs sampling to achieve exact posterior characterizations and bounded mixing times even in ultra-high-dimensional settings.
Scalable alternatives such as variational Bayes and expectation propagation enable rapid convergence and accurate uncertainty quantification, making the approach practical for large-scale applications.

High-dimensional Bayesian probit regression addresses binary (and more generally categorical) regression problems where the number of predictors $p$ may be large relative to, or even exceed, the number of observations $n$ . The objective is to infer the posterior distribution of the regression coefficients $\beta\in\mathbb{R}^p$ under a probit link, typically with a Gaussian or sparsity-inducing prior, and to provide accurate uncertainty quantification and efficient computation despite high dimensionality. The model has served as a canonical testbed for scalable Bayesian computation, the study of MCMC convergence complexity, and the development of novel conjugate characterizations such as the unified skew-normal posterior.

1. Model Specification and Latent Variable Representation

High-dimensional Bayesian probit regression models data as $(x_i, y_i)$ for $i=1,...,n$ , with $y_i\in\{0,1\}$ and $x_i \in \mathbb{R}^{p}$ . The likelihood has the form:

$\mathbb{P}(y_i=1 \mid x_i, \beta) = \Phi(x_i^{\top}\beta)$

where $\Phi$ denotes the standard normal CDF and $\beta$ the regression coefficients. A Gaussian prior is often placed on $n$ 0:

$n$ 1

An alternative, particularly important for $n$ 2, is a sparsity-inducing prior such as the spike-and-slab or continuous shrinkage (e.g., horseshoe, Bayesian lasso). For computation and theoretical analysis, the latent-variable augmentation of Albert and Chib introduces $n$ 3 and $n$ 4, so that the joint likelihood is expressed as a truncated multivariate Gaussian in $n$ 5 conditioned on $n$ 6 (Qin et al., 2017, Durante, 2018, Chakraborty et al., 2016, Fasano et al., 2019, Anceschi et al., 2022, Banerjee et al., 2021).

2. Posterior Characterization and Unified Skew-Normal Conjugacy

A central advance for high-dimensional probit models is the recognition that, under a (possibly improper) Gaussian prior, the posterior of $n$ 7 given $n$ 8 is a unified skew-normal (SUN) distribution:

$n$ 9

with explicit expressions for $\beta\in\mathbb{R}^p$ 0 in terms of $\beta\in\mathbb{R}^p$ 1, $\beta\in\mathbb{R}^p$ 2, $\beta\in\mathbb{R}^p$ 3 (Durante, 2018, Anceschi et al., 2022, Fasano et al., 2020). This representation allows for exact closed-form expressions for the marginal likelihood, posterior mean and covariance, and predictive probabilities. The SUN characterization also enables exact i.i.d. sampling from the posterior, circumventing Markov chain Monte Carlo entirely in moderate $\beta\in\mathbb{R}^p$ 4 scenarios, and undergirds scalable approximate inference such as blockwise partially-factorized variational Bayes.

3. Data Augmentation Gibbs Sampler: Convergence and Complexity

The Albert–Chib two-block Gibbs sampler alternates:

Sampling $\beta\in\mathbb{R}^p$ 5 independently via one-sided truncated normals,
Sampling $\beta\in\mathbb{R}^p$ 6 via a multivariate Gaussian.

Crucially, the sampler is geometrically ergodic with a Gaussian prior for all $\beta\in\mathbb{R}^p$ 7, and the trace-class property (i.e., summable eigenvalues of the transition operator) holds under uniform spectral criteria on $\beta\in\mathbb{R}^p$ 8 and the prior precision $\beta\in\mathbb{R}^p$ 9 (Chakraborty et al., 2016). The inclusion of parameter expansion steps (PX-DA) strictly accelerates mixing in the sense of spectral radius reduction.

Explicit, non-asymptotic mixing time bounds have been established: for $(x_i, y_i)$ 0 and $(x_i, y_i)$ 1,

$(x_i, y_i)$ 2

uniformly over $(x_i, y_i)$ 3 if $(x_i, y_i)$ 4 is chosen appropriately (e.g., "g-prior", $(x_i, y_i)$ 5). The mixing times remain bounded as $(x_i, y_i)$ 6 unless an unpenalized intercept or unbalanced outcome scenario is present, in which case $(x_i, y_i)$ 7 grows linearly with $(x_i, y_i)$ 8 (Ascolani et al., 20 May 2025, Lee et al., 2024, Qin et al., 2017). These theoretical claims are corroborated by empirical coupling and total-variation calculations.

4. Variational Bayes and Expectation Propagation in High Dimensions

Standard mean-field variational Bayes (MFVB) approximates the joint posterior as $(x_i, y_i)$ 9 with tractable updates. However, MFVB is pathologically biased in high dimensions ( $i=1,...,n$ 0), systematically underestimating variances and shrinking coefficients towards zero. Predictive distributions collapse towards $i=1,...,n$ 1, and the KL divergence between MFVB and the true posterior remains bounded away from zero as $i=1,...,n$ 2 (Fasano et al., 2019).

Partially-factorized variational Bayes (PF-VB), which preserves the conditional of $i=1,...,n$ 3 exactly, admits closed-form updates for all factors and yields a variational posterior in the SUN family. Under standard random-design and prior scaling assumptions, PF-VB achieves vanishing KL divergence to the exact posterior as $i=1,...,n$ 4 (fixed $i=1,...,n$ 5). The coordinate ascent variational inference (CAVI) algorithm for PF-VB rapidly converges, usually in a single pass (Fasano et al., 2019, Anceschi et al., 2022, Fasano et al., 2022).

Expectation propagation (EP) offers another scalable deterministic alternative. By exploiting the mapping between the "tilted" site distributions in the probit model and extended skew-normal forms, EP admits efficient rank-one updates and per-iteration $i=1,...,n$ 6 computational complexity. Empirical studies have shown EP recovers posterior mean and variance as accurately as HMC (Stan) for moderate $i=1,...,n$ 7 and dramatically outpaces MCMC in runtime (Fasano et al., 2023, Aliverti, 10 Nov 2025).

5. Sparse Priors and High-dimensional Variable Selection

To address ultra-sparse, ultra-high-dimensional regimes, spike-and-slab priors (mixtures of a point mass at zero and a diffuse slab, often Gaussian or Laplace) or continuous shrinkage priors (horseshoe, Dirichlet-Laplace) are routinely imposed:

$i=1,...,n$ 8

For these priors, Gibbs and reversible-jump samplers target the extended joint posterior, alternating latent $i=1,...,n$ 9 updates, $y_i\in\{0,1\}$ 0 conditionals, and inclusion/scale parameter updates. Recent mean-field VB analogues for sparse probit regression yield coordinate ascent schedules providing posterior inclusion probabilities, support recovery, and predictive probabilities (Fasano et al., 29 Jan 2026, Banerjee et al., 2021).

Theoretical results include (i) minimax-optimal posterior contraction rates in $y_i\in\{0,1\}$ 1 and $y_i\in\{0,1\}$ 2 norm ( $y_i\in\{0,1\}$ 3), (ii) strong model selection consistency (posterior concentrates on the true support if signals exceed the threshold), and (iii) credible region validity under excessive-bias restriction (Banerjee et al., 2021).

6. Extensions: Multiparameter and Multivariate Probit, Marginal Inference

High-dimensional multivariate and multinomial probit generalize to multiple response variables per sample. Unified skew-normal conjugacy extends: the posterior on the joint coefficient vector is again SUN, enabling closed-form i.i.d. sampling, marginal likelihood computation, and Bayesian model averaging (Fasano et al., 2022, Fasano et al., 2020). For tasks such as high-dimensional ecology (species co-occurrence), two-stage marginal Laplace/variational approximations are used, which scale via parallelization (Chakraborty et al., 2021).

Approaches combining Laplace and variational approximations for marginal posteriors yield fast, accurate estimation for each regression subproblem, bypassing intractable high-dimensional integration in the full model (Chakraborty et al., 2021, Aliverti, 10 Nov 2025).

7. Computational Considerations and Practical Implementation

Efficient matrix algebra leveraging the Woodbury identity, low-rank plus diagonal structures, and blockwise or coordinate updates are essential for feasibility in $y_i\in\{0,1\}$ 4 or $y_i\in\{0,1\}$ 5 regimes. Exact SUN-based algorithms can leverage sparsity in $y_i\in\{0,1\}$ 6, Cholesky factorizations, and batch parallelism. For massive $y_i\in\{0,1\}$ 7 or $y_i\in\{0,1\}$ 8, stochastic data subsampling, mini-batch variational methods, and sparse prior/coordinate EB/EP are advocated. The observed practical runtime for PF-VB and efficient EP is orders of magnitude faster than HMC or block Gibbs, with negligible loss in precision provided the prior is correctly scaled and model regularization is suitable for the design (Fasano et al., 2019, Fasano et al., 2023, Fasano et al., 29 Jan 2026).

Empirical work confirms the theoretical predictions for MCMC and variational mixing times, convergence rates, and approximation accuracy, even in settings with $y_i\in\{0,1\}$ 9 or hierarchical, nonstandard designs.