Papers
Topics
Authors
Recent
Search
2000 character limit reached

HS-MoE: Bayesian Sparse Expert Routing

Updated 21 January 2026
  • HS-MoE models integrate horseshoe priors with input-dependent gating to enforce data-adaptive sparsity in expert utilization.
  • A dedicated particle learning algorithm ensures efficient sequential inference by propagating only sufficient statistics for streaming data.
  • Empirical evaluations demonstrate competitive predictive accuracy with minimal active experts while providing robust uncertainty quantification.

Horseshoe Mixtures-of-Experts (HS-MoE) models constitute a Bayesian approach for sparse expert selection within mixture-of-experts (MoE) architectures, integrating the adaptive global-local shrinkage properties of the horseshoe prior with input-dependent gating. The principal motivation is to achieve data-adaptive sparsity in expert utilization, thereby enabling efficient and uncertainty-aware routing across a potentially large pool of experts. A dedicated particle learning algorithm enables sequential inference with memory and computational efficiency by propagating only sufficient statistics forward in time. The HS-MoE formulation is closely related to modern sparse MoE layers used in LLMs under strict sparsity constraints, differing crucially in its Bayesian treatment of uncertainty and sparsity control (Polson et al., 14 Jan 2026).

1. Model Specification and Priors

Given input xRdx \in \mathbb{R}^d and response yy, HS-MoE introduces KK experts, each parameterized by θj\theta_j and an input-dependent gating function gj(x;ϕ)g_j(x; \phi). The marginal predictive model is defined as:

p(yx,Θ)=j=1Kgj(x;ϕ)fj(yx;θj)p(y \mid x, \Theta) = \sum_{j=1}^K g_j(x; \phi) \, f_j(y \mid x; \theta_j)

where fjf_j represents the expert likelihood. Introducing a latent assignment z{1,,K}z \in \{1, \ldots, K\}, the generative process is:

  • zx,ϕCategorical(g1(x),,gK(x))z \mid x, \phi \sim \mathrm{Categorical}(g_1(x), \ldots, g_K(x))
  • yx,z=j,θjfj(yx;θj)y \mid x, z=j, \theta_j \sim f_j(y \mid x; \theta_j)

A canonical choice for fjf_j is the Gaussian linear expert:

fj(yx;θj)=N(y;xβj,σj2),θj=(βj,σj2)f_j(y \mid x; \theta_j) = \mathcal{N}(y; x^\top \beta_j, \sigma_j^2), \quad \theta_j = (\beta_j, \sigma_j^2)

Sparsity induction—enabling only a subset of experts to be meaningfully used for a given input—leverages the horseshoe prior: βjσj2,λj,τN(0,σj2τ2λj2),λjC+(0,1),τC+(0,τ0)\beta_j \mid \sigma_j^2, \lambda_j, \tau \sim \mathcal{N}(0,\, \sigma_j^2 \tau^2 \lambda_j^2), \quad \lambda_j \sim \mathcal{C}^+(0,1), \quad \tau \sim \mathcal{C}^+(0, \tau_0) C+(0,s)\mathcal{C}^+(0,s) denotes a half-Cauchy with scale ss, and an analogous horseshoe prior can be imposed on gate weights ϕj\phi_j.

2. Input-Dependent Gating and Data-Adaptive Expert Sparsity

A common gate parameterization is the softmax function:

gj(x;ϕ)=exp(ϕjx)k=1Kexp(ϕkx),with ϕK=0g_j(x; \phi) = \frac{\exp(\phi_j^\top x)}{\sum_{k=1}^K \exp(\phi_k^\top x)}, \quad \text{with} \ \phi_K = 0

The horseshoe prior on each gating vector ϕj\phi_j incorporates a local scale λϕ,j\lambda_{\phi,j} and a global scale τϕ\tau_\phi. The local-global structure ensures that while the global parameter τϕ\tau_\phi sets the overall gate sparsity, local scales λϕ,j\lambda_{\phi,j} allow certain experts' gates to "escape" shrinkage and become active in specific data regions. This mechanism yields data-adaptive sparsity, with most gj(x)g_j(x) near zero for any xx, except for a small dynamically determined set of active experts.

3. Sequential Inference via Particle Learning

HS-MoE employs a particle learning (PL) algorithm, a sequential Monte Carlo method that propagates only sufficient statistics. For the Gaussian expert scenario, these sufficient statistics per expert jj are (mj,t,Vj,t,aj,t,bj,t)(m_{j,t}, V_{j,t}, a_{j,t}, b_{j,t}), representing the Normal-inverse-gamma conjugate posterior parameters for (βj,σj2)(\beta_j, \sigma_j^2). For each gate stick (in logistic stick-breaking parameterization), sufficient statistics are (Λk,t,hk,t)(\Lambda_{k,t}, h_{k,t}), yielding ϕkN(Λk,t1hk,t,Λk,t1)\phi_k \sim \mathcal{N}(\Lambda_{k,t}^{-1} h_{k,t}, \Lambda_{k,t}^{-1}).

The algorithm iteratively updates particle weights using predictive densities, resamples according to these weights, samples expert assignments zt(i)z_t^{(i)}, and updates corresponding sufficient statistics with observations (xt,yt)(x_t, y_t). Pólya–Gamma augmentation facilitates logistic gate updates. Refreshing horseshoe scales (τ,λ)(\tau, \lambda) can be optionally performed via Gibbs or slice sampling steps. This sequential inference framework is amenable to streaming data and is memory-efficient, requiring only the storage of particle-level sufficient statistics.

4. Computational and Statistical Considerations

HS-MoE’s global-local shrinkage effects allow adaptive expert selection. The global scale τ\tau controls the overall sparsity—smaller τ\tau causes most experts to collapse. The local scales λj\lambda_j modulate whether individual experts are "enabled," facilitating expert sharing when λj\lambda_j are large.

The particle learning algorithm offers per-timestep computational complexity of O(NKd2)O(N K d^2) for NN particles, KK experts, and dd-dimensional inputs, with rank-one Cholesky updates as the computational bottleneck. Unlike conventional batch MCMC, which has complexity O(TnKd2)O(T n K d^2) (for nn data points and TT iterations), PL is streaming and maintains O(NKd2)O(N K d^2) memory. Particle approximations converge to the true posterior as NN \to \infty under classical results from interacting particle systems theory (Del Moral, Gordon et al.).

5. Relation to Modern Sparse MoE Layers in Neural Architectures

Sparse Transformer-MoE architectures (e.g., Shazeer et al., Switch-Transformer) typically employ hard top-kk routing by selecting the largest ϕjx\phi_j^\top x for each token in a sequence, resulting in a deterministic and fixed number of active experts per input. In contrast, HS-MoE replaces this procedure with a Bayesian router: horseshoe shrinkage on ϕj\phi_j renders most logits negligible, yielding a soft top-kk effect. At deployment, kk experts with the largest posterior mean logits can be selected, but these logits reflect model uncertainty and adapt to streaming data.

The effective kk in HS-MoE is data-driven, rather than fixed, and its Bayesian structure provides uncertainty quantification for expert assignment. This suggests safer expert routing in scenarios with domain shift or changing data regimes.

6. Empirical Validation

Empirical evaluation is illustrated on a synthetic Gaussian-linear regression task with K=10K=10 experts, of which only s=3s=3 are truly active. Using n=500n=500 samples, d=5d=5 features, and N=1000N=1000 particles, the experts are assigned Gaussian linear models with Normal-inverse-gamma priors and a softmax gate under the horseshoe prior (inactive gate bias 3.0-3.0, global τ0=0.7\tau_0=0.7).

A summary of the empirical findings is presented below:

Metric HS-MoE Value Baseline Comparison
True positive expert identification 98%\approx 98\% n/a
False activation rate on inactive experts <5%<5\% n/a
Predictive log-likelihood gap within $0.1$ nats/sample vs. oracle (oracle = 3 active)
Average experts used ("effective kk") 3.2\approx 3.2 Soft MoE: k=5k=5 (fixed)

Estimated allocation frequencies accurately recover the three true active experts, with inactive experts’ frequencies shrunk to nearly zero. Predictive performance matches that of baselines using more experts on average, and HS-MoE yields well-calibrated routing probabilities. This suggests that HS-MoE provides both competitive predictive accuracy and a minimal effective number of experts, with the added benefit of uncertainty quantification (Polson et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Horseshoe Mixtures-of-Experts (HS-MoE).