Token Similarity-Aware Routing

Updated 17 December 2025

Token similarity-aware routing is a neural strategy that assigns tokens based on measured similarities, integrating relational data for improved accuracy.
It employs methods like Similarity-Aware MoE, Attention-Aware MoE, and prototypical routing to reduce routing entropy and ensure stable expert assignment.
Empirical results demonstrate lower perplexity, enhanced computational efficiency, and more robust performance in language, vision, and retrieval tasks.

Token similarity-aware routing refers to a family of neural routing strategies in which the assignment of tokens (or token representations) to computational resources—typically experts or memory slots—explicitly depends on the measured similarity or relational structure between tokens. These methods have emerged as a critical advancement in sparse mixture-of-experts (MoE) architectures, multimodal diffusion systems, retrieval models, and hierarchical LLM routing, where they are employed to improve stability, utilization, robustness, and efficiency by breaking conditional-independence assumptions inherent in classic softmax-based gating. Leading frameworks include Similarity-Aware MoE, Attention-Aware MoE, similarity-preserving load balancing, prototypical routing, and token-level data-driven neural routers.

1. Motivation and Conceptual Foundations

The core motivation for token similarity-aware routing stems from limitations in standard routing schemes, especially in MoE architectures. In the canonical SMoE, each token is independently routed to experts based purely on its embedding via a softmax gate, resulting in conditional independence of expert assignments across tokens. This independence leads to late-stage routing fluctuations, non-robustness, and suboptimal load balancing (Nguyen et al., 1 May 2025). Similarly, in multi-vector retrieval and multi-model fusion, all-to-all or static mapping-based routing can be computationally prohibitive and semantically inflexible (Li et al., 2022, Liu et al., 15 Nov 2025).

Token similarity-aware routing seeks to exploit inter-token relational structure. Instead of treating every token as an independent routing decision, these methods couple routing choices for similar tokens—either by explicitly using pairwise similarities, attention matrices, or learnable association schemes. This framework introduces global or groupwise constraints that reduce entropy in expert selection, stabilize routing decisions, preserve redundant computation, and enable efficiency and robustness across tasks and domains (Nguyen et al., 1 May 2025, Omi et al., 16 Jun 2025).

2. Core Methodologies

Several instantiations of similarity-aware token routing have been developed, each tailored to a specific architectural or application context.

2.1 Similarity-Aware MoE (S-MoE)

S-MoE explicitly incorporates a token similarity variable $s_i$ for each token $u_i$ . The router samples $s_i$ from a distribution over the token batch, weighted by a softmax of the (possibly learned) similarity matrix $S$ :

$S_{ij} = \mathrm{Softmax}\!\Bigl(\tfrac{1}{T}\,u_i^\top W_s u_j\Bigr).$

The token $u_i$ ’s expert assignment $e_i$ is then derived from the router output of its selected similar token $u_{s_i}$ :

$\hat o_i = \sum_{e=1}^E\sum_{j=1}^N S_{ij}\,r_e(u_j)\,g_e(u_i).$

The corresponding sparse version uses a TopK operation on the aggregated expert scores:

$\widetilde o_i = \sum_{e=1}^E \mathrm{TopK}\!\Bigl(\sum_{j=1}^N S_{ij}\,r_e(u_j)\Bigr)\;g_e(u_i).$

2.2 Attention-Aware MoE (A-MoE)

In attention-aware routing, the transformer’s attention matrix provides the coupling between tokens:

$A^*_{ij} = P(z_i = j \mid u_i, X) = A_{h^*}[i, j],$

where $h^*$ is the attention head with minimum entropy. The expert aggregation uses this attention-derived coupling instead of global similarity:

$\widetilde o_i = \sum_{e=1}^E \mathrm{TopK}\!\Bigl(\sum_{j=1}^N A^*_{ij} r_e(u_j)\Bigr) g_e(u_i).$

These mechanisms reduce the routing entropy, resulting in more stable expert assignment and lower routing fluctuation (Nguyen et al., 1 May 2025).

2.3 Similarity-Preserving Load Balancing

To address over-collapsed routers, SimBal penalizes deviation of the router’s parameter Gram matrix $R^\top R$ from identity via an $\ell_1$ loss:

$L_{orth} = \|R^\top R - I_E\|_1,$

softly preserving input similarity in the expert-logit space. As a result, similar tokens are consistently mapped to similar router outputs, preserving semantic consistency and accelerating convergence (Omi et al., 16 Jun 2025).

2.4 Prototypical Routing

In vision-centric MoEs, prototypical routing employs learned prototype vectors $p_j$ , one per expert, and routes conditional tokens $z_i$ by cosine similarity:

$r_{i, j} = \alpha \frac{\langle z_i, p_j \rangle}{\|z_i\| \|p_j\|}.$

Top-K selection and a routing contrastive loss jointly enforce intra-expert coherence (pulling token representations towards their assigned prototype) and inter-expert diversity (repelling them from others) (Wei et al., 28 Oct 2025).

2.5 Token-Wise Neural Routing

In multimodal and hierarchical settings, routers are constructed as lightweight neural networks that, at each token, process contextual features (embedding, noisy latents, time signals, or SLM logits) to predict assignment to hidden states, layers, or model pathways. Sparsity is enforced by TopK or $\epsilon$ -greedy exploration, and the process can be made token- or timestep-dependent (Liu et al., 15 Nov 2025, Fu et al., 27 May 2025).

2.6 Dynamic Lexical Routing for Retrieval

CITADEL routes token representations to a dynamically learned set of lexical keys (learned via a SPLADE activation over the MLM head), enabling efficient late-interaction retrieval only among tokens sharing the same key. Routing is sparse (Top-K/Top-L per token), learns semantic as well as lexical correspondences, and includes router-level contrastive and regularization losses (Li et al., 2022).

3. Theoretical Guarantees and Stability

A central theoretical result is the entropy reduction property of similarity-aware routing. Let $r(u_i)$ be the baseline router score for token $i$ , and $p_i$ the similarity- or attention-smoothed version:

$p_i(e) = \sum_{j} s_{ij} r_e(u_j), \quad (s_{ij} = S_{ij} \ \text{or} \ A^*_{ij}).$

The Shannon entropy is provably bounded:

$H(p_i) \leq \sum_j s_{ij} H(r(u_j)) + H(s_{i\cdot}) \leq H(r(u_i)) \quad (\text{as } T \to 0 \text{ or } \sigma \to 0).$

Thus, as the similarity kernel is sharpened, the router’s uncertainty is strictly decreased, which reduces late-stage routing flips and increases training stability. This property has been empirically validated in both language and vision SMoE systems (Nguyen et al., 1 May 2025).

4. Applications and Empirical Results

Token similarity-aware routing has been instantiated across a range of tasks and models, with demonstrated empirical benefits.

Domain	Approach(s)	Key Gains
Language Modeling	S-MoE, A-MoE, SimBal, R2R	↓ Perplexity by 2–4 pts, ↑ routing stability, ↓ redundancy
Vision	Prototypical Routing (ProMoE)	FID ↓21–29%, IS ↑20–35, surpasses dense and generic MoE baselines
Multimodal Generation	MoS	Matches/↑ SOTA at 4× ↓ params, negligible overhead
Retrieval	CITADEL	40× speedup (vs ColBERT-v2), similar nDCG@10, index shrinks >8×
Inference Routing	R2R	1.4–2.4× accuracy at same param budget, 2.8× speedup

In language modeling (WikiText-103), S-MoE and A-MoE reduce perplexity by 2–3 points on clean and 3–4 under adversarial token swaps, with expert flips dropping from 33% (SMoE) to ≤15%. In vision, ProMoE delivers 21–29% lower FID and higher IS at equivalent or sublinear compute (Nguyen et al., 1 May 2025, Wei et al., 28 Oct 2025).

For retrieval, CITADEL achieves ∼40× GPU latency speedup and a similar nDCG/MRR to ColBERT-v2, with index size and interaction counts sharply reduced (Li et al., 2022).

In test-time routing, R2R leverages similarity in reasoning paths to route only 5–10% of tokens to the LLM, surpassing larger models in accuracy given equal compute (Fu et al., 27 May 2025).

5. Algorithmic Procedures

Although implementations vary, general algorithmic structures for similarity-aware routing can be summarized as follows:

Similarity Matrix Construction: For each token, compute similarities with other tokens (using embedding dot products, attention scores, or learned keys).
Aggregation and Routing: Aggregate expert or pathway scores by mixing router logits/gates of similar tokens weighted by similarity or attention.
Sparse Selection: Enforce sparsity via TopK or $\epsilon$ -greedy selection.
Output and Backpropagation: Compute token outputs as expert-weighted sums, and optimize the full loss, which may include auxiliary similarity or routing contrastive terms.

Specialized variants include:

Dynamic routing conditioned on time-step and input for diffusion models (Liu et al., 15 Nov 2025)
Feed-forward classifier-based token-level routing in hierarchical decoding (Fu et al., 27 May 2025)
SPLADE-based key gating for retrieval (Li et al., 2022)
Gram-matrix regularization for orthogonality (Omi et al., 16 Jun 2025)
Contrastive losses in prototypical routing (Wei et al., 28 Oct 2025)

6. Challenges, Limitations, and Future Directions

Current limitations include scalability to very large token batches (quadratic similarity computation), dependence on the quality of similarity metrics or semantic prototypes, and the need for careful hyperparameter selection (e.g., temperature $T$ , $\lambda$ coefficients). Architecturally, many variants have only been evaluated on specific backbone families or domain-limited pretraining (Omi et al., 16 Jun 2025, Wei et al., 28 Oct 2025).

Open research areas include hybrid explicit-pairwise and orthogonal regularization, efficient GPU implementations of global or attention-based similarity, and generalization to multi-step or hierarchical expert selection. Techniques for combining loss-free balancing with similarity preservation, and exploring new norms for relational regularization, are identified as promising directions (Omi et al., 16 Jun 2025).

7. Broader Impact and Significance

Token similarity-aware routing has advanced the state of the art across language modeling, vision, retrieval, and efficient inference. By breaking conditional-independence assumptions, these approaches yield more stable, robust, and efficient models that better leverage parameter scale without sacrificing utilization or incurring catastrophic collapse. In settings from data-intensive pretraining to multimodal fusion and serving-time model cascades, they represent a convergence point where architectural efficiency is matched with representational alignment, opening the door to further scaling and unification of sparse neural computation (Nguyen et al., 1 May 2025, Fu et al., 27 May 2025, Liu et al., 15 Nov 2025, Li et al., 2022, Omi et al., 16 Jun 2025, Wei et al., 28 Oct 2025).