HS-MoE: Bayesian Sparse Expert Routing
- HS-MoE models integrate horseshoe priors with input-dependent gating to enforce data-adaptive sparsity in expert utilization.
- A dedicated particle learning algorithm ensures efficient sequential inference by propagating only sufficient statistics for streaming data.
- Empirical evaluations demonstrate competitive predictive accuracy with minimal active experts while providing robust uncertainty quantification.
Horseshoe Mixtures-of-Experts (HS-MoE) models constitute a Bayesian approach for sparse expert selection within mixture-of-experts (MoE) architectures, integrating the adaptive global-local shrinkage properties of the horseshoe prior with input-dependent gating. The principal motivation is to achieve data-adaptive sparsity in expert utilization, thereby enabling efficient and uncertainty-aware routing across a potentially large pool of experts. A dedicated particle learning algorithm enables sequential inference with memory and computational efficiency by propagating only sufficient statistics forward in time. The HS-MoE formulation is closely related to modern sparse MoE layers used in LLMs under strict sparsity constraints, differing crucially in its Bayesian treatment of uncertainty and sparsity control (Polson et al., 14 Jan 2026).
1. Model Specification and Priors
Given input and response , HS-MoE introduces experts, each parameterized by and an input-dependent gating function . The marginal predictive model is defined as:
where represents the expert likelihood. Introducing a latent assignment , the generative process is:
A canonical choice for is the Gaussian linear expert:
Sparsity induction—enabling only a subset of experts to be meaningfully used for a given input—leverages the horseshoe prior: denotes a half-Cauchy with scale , and an analogous horseshoe prior can be imposed on gate weights .
2. Input-Dependent Gating and Data-Adaptive Expert Sparsity
A common gate parameterization is the softmax function:
The horseshoe prior on each gating vector incorporates a local scale and a global scale . The local-global structure ensures that while the global parameter sets the overall gate sparsity, local scales allow certain experts' gates to "escape" shrinkage and become active in specific data regions. This mechanism yields data-adaptive sparsity, with most near zero for any , except for a small dynamically determined set of active experts.
3. Sequential Inference via Particle Learning
HS-MoE employs a particle learning (PL) algorithm, a sequential Monte Carlo method that propagates only sufficient statistics. For the Gaussian expert scenario, these sufficient statistics per expert are , representing the Normal-inverse-gamma conjugate posterior parameters for . For each gate stick (in logistic stick-breaking parameterization), sufficient statistics are , yielding .
The algorithm iteratively updates particle weights using predictive densities, resamples according to these weights, samples expert assignments , and updates corresponding sufficient statistics with observations . Pólya–Gamma augmentation facilitates logistic gate updates. Refreshing horseshoe scales can be optionally performed via Gibbs or slice sampling steps. This sequential inference framework is amenable to streaming data and is memory-efficient, requiring only the storage of particle-level sufficient statistics.
4. Computational and Statistical Considerations
HS-MoE’s global-local shrinkage effects allow adaptive expert selection. The global scale controls the overall sparsity—smaller causes most experts to collapse. The local scales modulate whether individual experts are "enabled," facilitating expert sharing when are large.
The particle learning algorithm offers per-timestep computational complexity of for particles, experts, and -dimensional inputs, with rank-one Cholesky updates as the computational bottleneck. Unlike conventional batch MCMC, which has complexity (for data points and iterations), PL is streaming and maintains memory. Particle approximations converge to the true posterior as under classical results from interacting particle systems theory (Del Moral, Gordon et al.).
5. Relation to Modern Sparse MoE Layers in Neural Architectures
Sparse Transformer-MoE architectures (e.g., Shazeer et al., Switch-Transformer) typically employ hard top- routing by selecting the largest for each token in a sequence, resulting in a deterministic and fixed number of active experts per input. In contrast, HS-MoE replaces this procedure with a Bayesian router: horseshoe shrinkage on renders most logits negligible, yielding a soft top- effect. At deployment, experts with the largest posterior mean logits can be selected, but these logits reflect model uncertainty and adapt to streaming data.
The effective in HS-MoE is data-driven, rather than fixed, and its Bayesian structure provides uncertainty quantification for expert assignment. This suggests safer expert routing in scenarios with domain shift or changing data regimes.
6. Empirical Validation
Empirical evaluation is illustrated on a synthetic Gaussian-linear regression task with experts, of which only are truly active. Using samples, features, and particles, the experts are assigned Gaussian linear models with Normal-inverse-gamma priors and a softmax gate under the horseshoe prior (inactive gate bias , global ).
A summary of the empirical findings is presented below:
| Metric | HS-MoE Value | Baseline Comparison |
|---|---|---|
| True positive expert identification | n/a | |
| False activation rate on inactive experts | n/a | |
| Predictive log-likelihood gap | within $0.1$ nats/sample | vs. oracle (oracle = 3 active) |
| Average experts used ("effective ") | Soft MoE: (fixed) |
Estimated allocation frequencies accurately recover the three true active experts, with inactive experts’ frequencies shrunk to nearly zero. Predictive performance matches that of baselines using more experts on average, and HS-MoE yields well-calibrated routing probabilities. This suggests that HS-MoE provides both competitive predictive accuracy and a minimal effective number of experts, with the added benefit of uncertainty quantification (Polson et al., 14 Jan 2026).