SMEBU: Soft-Clamped Expert Bias Updates

Updated 25 February 2026

The paper introduces SMEBU, a novel approach that replaces discontinuous, sign-based bias updates with smooth, tanh-clamped, momentum-damped adjustments to reduce oscillatory behavior.
SMEBU computes a normalized token load deviation and applies a mean-centered update, ensuring balanced expert utilization without auxiliary loss terms.
Empirical results on the Trinity Large model demonstrate that SMEBU stabilizes routing and mitigates expert collapse, effectively managing extreme sparsity in MoE architectures.

Soft-Clamped Momentum Expert Bias Updates (SMEBU) is a bias adaptation algorithm introduced for balancing expert utilization in sparse Mixture-of-Experts (MoE) architectures, specifically within the Arcee Trinity Large model comprising 400 billion parameters with 256 experts per layer and 13 billion activated per token. SMEBU addresses limitations of prior aux-loss-free expert balancing methods by enforcing smooth, bounded, and momentum-damped updates to each expert’s router bias, thereby promoting stable and equitable token assignment without explicit auxiliary loss terms (Singh et al., 19 Feb 2026).

1. Motivation and Background

In extremely sparse MoE models, balanced routing—meaning that each expert is assigned a roughly equal number of tokens—is critical for ensuring that all experts actively participate in learning and that model capacity is efficiently utilized. Prior approaches, particularly “aux-loss-free” balancing, updated each expert’s routing bias via a sign-based rule:

$\Delta b_i = \gamma \cdot \text{sign}(\bar n - n_i)$

where $\gamma$ is a step size, $n_i$ the number of tokens routed to expert $i$ , and $\bar n$ the average tokens per expert. These per-step $\pm\gamma$ updates, followed by mean-centering, led to oscillation around the equilibrium, especially as the number of experts ( $N_r$ ) increased, precipitating “router drift,” elevated MaxVio (violation of target load balance), and ultimately, expert collapse.

SMEBU provides an alternative: rather than relying on discontinuous and fixed-magnitude steps, it introduces a smooth, magnitude-sensitive tanh clamp and supplements it with a simple momentum buffer, yielding bounded, gradual adaptations that effectively reduce oscillations and routing instability (Singh et al., 19 Feb 2026).

2. Algorithmic Formulation

The SMEBU update at each MoE layer and training step comprises the following elements:

Notation

$N_r$ : number of routed experts (256 in Trinity Large).
$n_i$ : tokens assigned to expert $i$ at the current training step.
$\bar n = \frac{1}{N_r}\sum_{i=1}^{N_r} n_i$ : mean expert load.
$b_i$ : maintained router bias per expert.
$m_i$ : momentum buffer per expert.

Stepwise Computation

Load Deviation (Violation)

$v_i = \frac{\bar n - n_i}{\bar n}$

This quantifies the relative token shortfall per expert.

Soft-Clamped Transformation

$\tilde v_i = \tanh(\kappa v_i)$

The hyperparameter $\kappa > 0$ (Trinity Large: $\kappa=2$ ) determines tanh saturation, bounding $\tilde v_i$ $\in$ $(-1, 1)$ .

Raw Bias Increment

$\Delta b_i^{raw} = \lambda \tilde v_i$

$\lambda$ is the per-step bias learning rate ( $\lambda=5\times 10^{-4}$ ), strictly bounding individual steps to $| \Delta b_i^{raw} | \leq \lambda$ .

Zero-Mean Centering

$\Delta b_i = \Delta b_i^{raw} - \frac{1}{N_r}\sum_{j=1}^{N_r} \Delta b_j^{raw}$

This ensures that bias updates collectively preserve the overall mean.

Momentum Damping

$m_i \leftarrow \beta m_i + (1-\beta)\Delta b_i$

The coefficient $\beta$ (Trinity Large: $\beta=0.5$ ) controls the temporal smoothing of bias adjustments.

Bias Update

$b_i \leftarrow b_i + m_i$

Pseudocode

Inputs:
  N_r     ← number of experts
  n[1..N_r]     ← token counts per expert
  b[1..N_r]     ← expert biases
  m[1..N_r]     ← momentum buffers
  λ, β, κ       ← hyperparameters

1. n_bar = (1/N_r) * sum_{i=1..N_r} n[i]
2. For i = 1..N_r:
     v[i] = (n_bar - n[i]) / n_bar
     v_tilde[i] = tanh(κ * v[i])
     delta_raw[i] = λ * v_tilde[i]
3. mean_delta_raw = (1/N_r) * sum_{i=1..N_r} delta_raw[i]
4. For i = 1..N_r:
     delta[i] = delta_raw[i] - mean_delta_raw
     m[i] = β * m[i] + (1-β) * delta[i]
     b[i] = b[i] + m[i]
5. Write back updated b[·], m[·].

3. Hyperparameters and Initialization

SMEBU exposes three principal hyperparameters:

Parameter	Typical Range / Trinity Large	Function
$\lambda$	$5\times10^{-4}$	Bounds per-step bias shift
$\kappa$	$2$ ( $\approx$ 1–5)	Controls clamp nonlinearity
$\beta$	$0.5$ (0.3–0.9)	Momentum coefficient

Initialization sets all $b_i$ and $m_i$ to zero. No additional auxiliary loss or bias decay is applied beyond SMEBU itself (although a small sequence-wise loss may also be present, handled separately).

Adjustment heuristics:

Increasing $\lambda$ or decreasing $\beta$ accelerates load equalization.
Increasing $\beta$ or decreasing $\lambda$ damps persistent oscillations.
Reducing $\kappa$ amplifies clamping (tighter bound).

4. Integration with Sigmoid Routing

In Trinity’s routing scheme, each token–expert pair’s unadjusted router score $s_{i,t} = \sigma(u_t^⊤ e_i)$ is offset for top- $K$ selection by $b_i$ . The forward pass for expert selection operates as:

Compute $s_{i,t}$ for each token $t$ and expert $i$ .
For each $t$ , select $K$ experts with the largest $(s_{i,t} + b_i)$ .
Prepare intermediate gates:

$g'_{i,t} = \begin{cases} s_{i,t} & \text{if } (s_{i,t} + b_i) \in \mathrm{Top}\text- K \ 0 & \text{otherwise} \end{cases}$

Final gate weights $g_{i,t} = g'_{i,t} / \sum_j g'_{j,t}$ .

SMEBU’s adaptive $b_i$ steers this process, nudging under-loaded experts toward selection, but does not interfere with the base sigmoid scores used for mixture weighting. Thus, it achieves bias correction for load balancing in a minimally invasive manner (Singh et al., 19 Feb 2026).

5. Empirical Observations and Practical Effects

During initial Trinity Large experiments using sign-only aux-loss-free bias updates, expert collapse and routing instability (MaxVio spikes, loss plateau) were acute. Following the introduction of SMEBU—among five other stabilizing modifications—routing balance was maintained and loss resumed smooth convergence. Controlled ablation of SMEBU in isolation was not performed at scale; however, in small-scale tests, SMEBU alone produced less volatile MaxVio than sign-based or unclamped linear methods.

Unclamped linear bias updates (i.e., $\lambda v_i$ without tanh or centering) led to late-training instabilities, providing further empirical rationale for inclusion of both the tanh clamp and mean-centering. The tightly bounded, momentum-damped steps permit stable adaptation in regimes of extreme expert sparsity (e.g., 256 experts, only $K_r=4$ active per token for Trinity Large), where classical approaches falter (Singh et al., 19 Feb 2026).

6. Significance and Implications

SMEBU constitutes a drop-in replacement for expert bias adaptation in aux-loss-free MoE balancing regimes, obviating the need for auxiliary load balancing losses. Through normalized violation measures, soft nonlinearity, centered and bounded delta steps, and temporal smoothing, SMEBU attains empirically robust expert utilization across large-scale, ultra-sparse MoE layers. A plausible implication is extensibility to other sparse expert-router architectures where stepwise bias management is critical and hard updates or auxiliary losses have proven ineffective or destabilizing. The absence of explicit auxiliary loss terms simplifies both tuning and computational overhead, representing a practically impactful methodological advance within scalable mixture-of-experts frameworks (Singh et al., 19 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Arcee Trinity Large Technical Report (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft-Clamped Momentum Expert Bias Updates (SMEBU).