HS-MoE: Bayesian Sparse Expert Routing

Updated 21 January 2026

HS-MoE models integrate horseshoe priors with input-dependent gating to enforce data-adaptive sparsity in expert utilization.
A dedicated particle learning algorithm ensures efficient sequential inference by propagating only sufficient statistics for streaming data.
Empirical evaluations demonstrate competitive predictive accuracy with minimal active experts while providing robust uncertainty quantification.

Horseshoe Mixtures-of-Experts (HS-MoE) models constitute a Bayesian approach for sparse expert selection within mixture-of-experts (MoE) architectures, integrating the adaptive global-local shrinkage properties of the horseshoe prior with input-dependent gating. The principal motivation is to achieve data-adaptive sparsity in expert utilization, thereby enabling efficient and uncertainty-aware routing across a potentially large pool of experts. A dedicated particle learning algorithm enables sequential inference with memory and computational efficiency by propagating only sufficient statistics forward in time. The HS-MoE formulation is closely related to modern sparse MoE layers used in LLMs under strict sparsity constraints, differing crucially in its Bayesian treatment of uncertainty and sparsity control (Polson et al., 14 Jan 2026).

1. Model Specification and Priors

Given input $x \in \mathbb{R}^d$ and response $y$ , HS-MoE introduces $K$ experts, each parameterized by $\theta_j$ and an input-dependent gating function $g_j(x; \phi)$ . The marginal predictive model is defined as:

$p(y \mid x, \Theta) = \sum_{j=1}^K g_j(x; \phi) \, f_j(y \mid x; \theta_j)$

where $f_j$ represents the expert likelihood. Introducing a latent assignment $z \in \{1, \ldots, K\}$ , the generative process is:

$z \mid x, \phi \sim \mathrm{Categorical}(g_1(x), \ldots, g_K(x))$
$y \mid x, z=j, \theta_j \sim f_j(y \mid x; \theta_j)$

A canonical choice for $f_j$ is the Gaussian linear expert:

$f_j(y \mid x; \theta_j) = \mathcal{N}(y; x^\top \beta_j, \sigma_j^2), \quad \theta_j = (\beta_j, \sigma_j^2)$

Sparsity induction—enabling only a subset of experts to be meaningfully used for a given input—leverages the horseshoe prior: $\beta_j \mid \sigma_j^2, \lambda_j, \tau \sim \mathcal{N}(0,\, \sigma_j^2 \tau^2 \lambda_j^2), \quad \lambda_j \sim \mathcal{C}^+(0,1), \quad \tau \sim \mathcal{C}^+(0, \tau_0)$ $\mathcal{C}^+(0,s)$ denotes a half-Cauchy with scale $s$ , and an analogous horseshoe prior can be imposed on gate weights $\phi_j$ .

2. Input-Dependent Gating and Data-Adaptive Expert Sparsity

A common gate parameterization is the softmax function:

$g_j(x; \phi) = \frac{\exp(\phi_j^\top x)}{\sum_{k=1}^K \exp(\phi_k^\top x)}, \quad \text{with} \ \phi_K = 0$

The horseshoe prior on each gating vector $\phi_j$ incorporates a local scale $\lambda_{\phi,j}$ and a global scale $\tau_\phi$ . The local-global structure ensures that while the global parameter $\tau_\phi$ sets the overall gate sparsity, local scales $\lambda_{\phi,j}$ allow certain experts' gates to "escape" shrinkage and become active in specific data regions. This mechanism yields data-adaptive sparsity, with most $g_j(x)$ near zero for any $x$ , except for a small dynamically determined set of active experts.

3. Sequential Inference via Particle Learning

HS-MoE employs a particle learning (PL) algorithm, a sequential Monte Carlo method that propagates only sufficient statistics. For the Gaussian expert scenario, these sufficient statistics per expert $j$ are $(m_{j,t}, V_{j,t}, a_{j,t}, b_{j,t})$ , representing the Normal-inverse-gamma conjugate posterior parameters for $(\beta_j, \sigma_j^2)$ . For each gate stick (in logistic stick-breaking parameterization), sufficient statistics are $(\Lambda_{k,t}, h_{k,t})$ , yielding $\phi_k \sim \mathcal{N}(\Lambda_{k,t}^{-1} h_{k,t}, \Lambda_{k,t}^{-1})$ .

The algorithm iteratively updates particle weights using predictive densities, resamples according to these weights, samples expert assignments $z_t^{(i)}$ , and updates corresponding sufficient statistics with observations $(x_t, y_t)$ . Pólya–Gamma augmentation facilitates logistic gate updates. Refreshing horseshoe scales $(\tau, \lambda)$ can be optionally performed via Gibbs or slice sampling steps. This sequential inference framework is amenable to streaming data and is memory-efficient, requiring only the storage of particle-level sufficient statistics.

4. Computational and Statistical Considerations

HS-MoE’s global-local shrinkage effects allow adaptive expert selection. The global scale $\tau$ controls the overall sparsity—smaller $\tau$ causes most experts to collapse. The local scales $\lambda_j$ modulate whether individual experts are "enabled," facilitating expert sharing when $\lambda_j$ are large.

The particle learning algorithm offers per-timestep computational complexity of $O(N K d^2)$ for $N$ particles, $K$ experts, and $d$ -dimensional inputs, with rank-one Cholesky updates as the computational bottleneck. Unlike conventional batch MCMC, which has complexity $O(T n K d^2)$ (for $n$ data points and $T$ iterations), PL is streaming and maintains $O(N K d^2)$ memory. Particle approximations converge to the true posterior as $N \to \infty$ under classical results from interacting particle systems theory (Del Moral, Gordon et al.).

5. Relation to Modern Sparse MoE Layers in Neural Architectures

Sparse Transformer-MoE architectures (e.g., Shazeer et al., Switch-Transformer) typically employ hard top- $k$ routing by selecting the largest $\phi_j^\top x$ for each token in a sequence, resulting in a deterministic and fixed number of active experts per input. In contrast, HS-MoE replaces this procedure with a Bayesian router: horseshoe shrinkage on $\phi_j$ renders most logits negligible, yielding a soft top- $k$ effect. At deployment, $k$ experts with the largest posterior mean logits can be selected, but these logits reflect model uncertainty and adapt to streaming data.

The effective $k$ in HS-MoE is data-driven, rather than fixed, and its Bayesian structure provides uncertainty quantification for expert assignment. This suggests safer expert routing in scenarios with domain shift or changing data regimes.

6. Empirical Validation

Empirical evaluation is illustrated on a synthetic Gaussian-linear regression task with $K=10$ experts, of which only $s=3$ are truly active. Using $n=500$ samples, $d=5$ features, and $N=1000$ particles, the experts are assigned Gaussian linear models with Normal-inverse-gamma priors and a softmax gate under the horseshoe prior (inactive gate bias $-3.0$ , global $\tau_0=0.7$ ).

A summary of the empirical findings is presented below:

Metric	HS-MoE Value	Baseline Comparison
True positive expert identification	$\approx 98\%$	n/a
False activation rate on inactive experts	$<5\%$	n/a
Predictive log-likelihood gap	within $0.1$ nats/sample	vs. oracle (oracle = 3 active)
Average experts used ("effective $k$ ")	$\approx 3.2$	Soft MoE: $k=5$ (fixed)

Estimated allocation frequencies accurately recover the three true active experts, with inactive experts’ frequencies shrunk to nearly zero. Predictive performance matches that of baselines using more experts on average, and HS-MoE yields well-calibrated routing probabilities. This suggests that HS-MoE provides both competitive predictive accuracy and a minimal effective number of experts, with the added benefit of uncertainty quantification (Polson et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Horseshoe Mixtures-of-Experts (HS-MoE) (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horseshoe Mixtures-of-Experts (HS-MoE).

HS-MoE: Bayesian Sparse Expert Routing

1. Model Specification and Priors

2. Input-Dependent Gating and Data-Adaptive Expert Sparsity

3. Sequential Inference via Particle Learning

4. Computational and Statistical Considerations

5. Relation to Modern Sparse MoE Layers in Neural Architectures

6. Empirical Validation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

HS-MoE: Bayesian Sparse Expert Routing

1. Model Specification and Priors

2. Input-Dependent Gating and Data-Adaptive Expert Sparsity

3. Sequential Inference via Particle Learning

4. Computational and Statistical Considerations

5. Relation to Modern Sparse MoE Layers in Neural Architectures

6. Empirical Validation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research