Token-level Adaptive Routing (TARo)

Updated 4 July 2026

Token-level Adaptive Routing (TARo) is a conditional computation mechanism that makes per-token routing decisions to allocate heterogeneous model resources based on token complexity.
It employs various routing strategies—including hard selection, soft aggregation, and Bayesian approaches—to dynamically choose among experts, adapters, or attention modules.
Empirical studies show that TARo improves speed–accuracy trade-offs and reduces computational cost, yielding significant latency and performance gains in diverse applications.

Searching arXiv for papers on Token-level Adaptive Routing and closely related usages of the term. arxiv_search("Token-level Adaptive Routing TARo")

Token-level Adaptive Routing (TARo) designates a class of conditional-computation mechanisms in which the routing decision is made for individual tokens or generation steps rather than for an entire sequence, a fixed expert budget, or a uniform network path. In the recent literature, TARo appears in several distinct but structurally related forms: token-level dynamic routing to a linear-time State Space Model expert or a quadratic-time Transformer expert in clinical question answering, token-dependent variation of the number of active experts in Mixture-of-Experts layers, adaptive interpolation between frozen base-model and reward-model logits during test-time alignment, token-wise selection among multiple attention mechanisms, per-token recursion-depth assignment, per-token collaboration between small and LLMs, and per-token switching between autoregressive and diffusion drafters or between discrete and latent reasoning modes (Khan et al., 3 Jan 2026, Gülmez, 2 Mar 2026, Rai et al., 19 Mar 2026, Ferrari, 27 May 2026, Zheng et al., 4 Feb 2025, Kwon et al., 5 Jun 2026, Zhang et al., 4 Jun 2026).

1. Scope and problem setting

Across these works, TARo addresses a common limitation of static computation allocation: different tokens within the same input may have different computational requirements, different preferred inductive biases, or different risk profiles. In MoE settings, the motivating restriction is fixed Top- $K$ routing, where exactly $K$ experts are activated per token; DynaMoE explicitly relaxes this assumption and allows the number of active experts per token to vary based on input complexity (Gülmez, 2 Mar 2026). In collaborative inference, CITER routes “non-critical tokens” to an SLM for efficiency and “critical tokens” to an LLM for generalization quality (Zheng et al., 4 Feb 2025). In test-time alignment, TARo replaces a fixed mixing coefficient $\alpha$ between a base model and a reward model with a learnable token-level router because the “optimal $\alpha$ varies across tasks, domains, and even decoding steps” (Rai et al., 19 Mar 2026).

A closely related line of work uses token routing even when the term TARo is not the sole organizing label. MEMatte inserts a router immediately before each global-attention layer and sends informative tokens to global attention while routing other tokens to a Lightweight Token Refinement Module, thereby reducing the quadratic burden of global self-attention on high-resolution image matting (Lin et al., 2024). AdaMoE permits a variable number of “true” experts per token by augmenting the expert set with “null experts,” which consume zero FLOPs and fill otherwise fixed top- $k$ slots (Zeng et al., 2024). MoLoRA makes a routing decision for every token over domain-specific LoRA adapters, while FusionRoute selects an expert at each decoding step and also adds a complementary logit from the router’s base LLM (Shah et al., 16 Mar 2026, Xiong et al., 8 Jan 2026).

Representative setting	Routed object	Representative decision
Clinical hybrid MoE	EMamba vs ET5	hard top-1 token routing (Khan et al., 3 Jan 2026)
Dynamic MoE	number of active experts	percentile-threshold token routing (Gülmez, 2 Mar 2026)
Test-time alignment	base vs reward guidance	adaptive $\alpha_t$ per token (Rai et al., 19 Mar 2026)
Efficient vision transformers	global attention vs LTRM	token-wise branch split (Lin et al., 2024)
Collaborative decoding	SLM vs LLM	token-level model choice (Zheng et al., 4 Feb 2025)
Speculative decoding	AR vs diffusion drafter	per-step paradigm selection (Kwon et al., 5 Jun 2026)

This suggests that TARo is better understood as a routing principle than as a single architecture: the routed entities may be experts, model heads, adapters, attention mechanisms, drafters, or reasoning modes, but the granularity of control is consistently token-wise.

2. Routing functions and selection rules

A defining feature of TARo is that the routing function is evaluated from token-local or step-local state. In MambaFormer, the router input for token $x_i$ is the fused feature

$R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$

where $x_i$ is the contextual embedding, $l_i\in[0,1]$ is normalized sequence length, and $K$ 0 is a binary domain flag. A 2-layer MLP computes

$K$ 1

followed at inference by the hard routing decision

$K$ 2

The selected expert then processes the token, and the paper states explicitly that this is “equivalent to a ‘top-1’ selection per token” with “No soft mixture or $K$ 3 selection” (Khan et al., 3 Jan 2026).

DynaMoE generalizes the selection rule by replacing fixed Top- $K$ 4 with a percentile threshold. For token representation $K$ 5, the gate computes $K$ 6, $K$ 7, defines a threshold $K$ 8, and selects

$K$ 9

with $\alpha$ 0. The layer output is then a soft aggregation over the selected experts,

$\alpha$ 1

and a minimum-activation step guarantees $\alpha$ 2 (Gülmez, 2 Mar 2026).

In LLM test-time alignment, TARo does not route a token to one expert or another; instead it routes the amount of guidance. The router consumes either full-logit concatenations or top- $\alpha$ 3 logits with index embeddings and produces

$\alpha$ 4

This adaptive mixing weight defines

$\alpha$ 5

or equivalently

$\alpha$ 6

The route is therefore continuous rather than categorical, but it remains token-level and inference-time (Rai et al., 19 Mar 2026).

Other instantiations preserve the same logic with different routed objects. Meta-Attention forms a token feature $\alpha$ 7, uses a 2-layer MLP to parameterize a Dirichlet posterior over routing weights $\alpha$ 8, and can route each token to full softmax attention, linear attention, or sliding-window local attention (Ferrari, 27 May 2026). MoLoRA uses a two-layer MLP $\alpha$ 9, followed by TopK and temperature-scaled softmax over selected adapters (Shah et al., 16 Mar 2026). Informed Routing replaces execute-or-skip with execute-or-approximate: the router predicts whether a transformer unit $\alpha$ 0 should run in full or whether the token should go through a Lightweight Feature Forecaster $\alpha$ 1 (Han et al., 10 Oct 2025).

A plausible implication is that TARo should not be identified with a single routing algebra. The literature contains hard top-1 selection, variable-cardinality selection, weighted soft aggregation, binary action routing, and Bayesian posterior routing, all at token granularity.

3. Optimization objectives and training regimes

The optimization of TARo varies substantially across applications. In MambaFormer, the router parameters $\alpha$ 2 are trained while the experts are frozen. The loss is

$\alpha$ 3

with

$\alpha$ 4

$\alpha$ 5

and

$\alpha$ 6

using $\alpha$ 7, $\alpha$ 8, and $\alpha$ 9. The paper states that this enforces that only $k$ 0 of tokens choose ET5 on average, yielding the 3.8% observed in practice (Khan et al., 3 Jan 2026).

CITER formulates token routing as policy optimization in a finite-horizon MDP with actions $k$ 1 and $k$ 2. Under the simplifying choices $k$ 3 and $k$ 4, the router is trained by minimizing token-wise cross-entropy on binary preference labels,

$k$ 5

The paper also introduces a three-case shortcut for reward estimation, and reports that “~80–90% of tokens fall into Cases 1–2,” reducing the cost of preference collection by $k$ 6 (Zheng et al., 4 Feb 2025).

TARPO instead uses pure RL. A lightweight action head parameterized by $k$ 7, $k$ 8 computes

$k$ 9

where $\alpha_t$ 0. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal $\alpha_t$ 1, and the minibatch objective combines token loss, action loss, and KL regularization (Zhang et al., 4 Jun 2026).

In Bayesian TARo, Meta-Attention places a compute-aware Dirichlet prior over routing weights and trains the amortized posterior $\alpha_t$ 2 using an ELBO objective. The KL term against the prior supplies a principled alternative to ad hoc balancing losses and also yields a posterior-entropy signal $\alpha_t$ 3 for soft-to-hard routing (Ferrari, 27 May 2026). By contrast, DynaMoE states that “No auxiliary balancing losses are required, but large-scale deployment may need capacity factors” (Gülmez, 2 Mar 2026).

In execute-or-approximate routing, training is two-stage. Informed Routing first freezes the base LLM and fits each Lightweight Feature Forecaster using

$\alpha_t$ 4

then freezes the LFFs and trains routers with

$\alpha_t$ 5

using Gumbel-Softmax to sample hard masks (Han et al., 10 Oct 2025). MoLoRA, in turn, trains LoRA and router parameters jointly with a standard language-modeling loss plus a Switch-style load-balancing auxiliary term (Shah et al., 16 Mar 2026).

These differences matter conceptually. Some TARo systems train only the router, some train router and backbone jointly, some freeze the routed experts, and some incorporate explicit compute terms in the loss while others rely on priors, schedules, or threshold calibration.

4. Architectural realizations

The architectural role of TARo depends on what part of the model is being conditionally activated. In hybrid MoE clinical assistance, MambaFormer places TARo over two experts: EMamba, “a linear-time State Space Model (SSM) expert,” and ET5, “a quadratic-time Transformer (T5-Large) expert.” The router uses contextual embeddings, normalized sequence length, and a domain-aware flag to decide which tokens should incur Transformer cost (Khan et al., 3 Jan 2026).

In efficient vision transformers, MEMatte inserts a Router plus a routing decision immediately before each global-attention layer. Tokens are split into a Global-Attention branch and a Lightweight Token Refinement Module branch, and after each block the two branches’ outputs are “simply re-concatenated (in the original token order)” before the next router. Training uses Batch-constrained Adaptive Token Routing (BATR), with a batch-level average routed ratio

$\alpha_t$ 6

and a compression loss $\alpha_t$ 7 (Lin et al., 2024).

In attention-mechanism routing, Meta-Attention maintains $\alpha_t$ 8 experts: full softmax attention, linear attention, and sliding-window local attention, with normalized costs $\alpha_t$ 9, $x_i$ 0, and $x_i$ 1. The Bayesian Meta-Controller can run all experts under soft routing or perform uncertainty-gated hard routing based on posterior entropy (Ferrari, 27 May 2026). In recursive transformers, MoR’s token-choice router assigns each token an up-front recursion depth $x_i$ 2, so that only tokens with $x_i$ 3 remain active at recursion $x_i$ 4; attention at step $x_i$ 5 is restricted to this active subset, and the framework supports both recursion-wise KV caching and recursive KV sharing (Bae et al., 14 Jul 2025).

In collaborative LLM systems, the routed object is usually the model invocation itself. CITER switches between SLM and LLM at each timestep while maintaining separate KV caches so that switching back does not require recomputing history (Zheng et al., 4 Feb 2025). The edge-device inference system based on the CITER router keeps the SLM on-device, deploys the token router on-device, and defers low-confidence tokens to a cloud LLM served under SGLang (She et al., 10 Apr 2025). FusionRoute uses a small linear routing head on a base LLM hidden state to select among experts at token step $x_i$ 6, but it also adds the router LLM’s own logit vector to the selected expert’s logits,

$x_i$ 7

so the router acts simultaneously as selector and complementary generator (Xiong et al., 8 Jan 2026).

TARo also appears in multimodal and diffusion settings. MoS routes hidden states from an understanding tower into a generation tower, producing denoising-timestep- and token-dependent interactions. For each generation block $x_i$ 8, it uses $x_i$ 9-greedy Top- $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 0 selection over context states and constructs a fused state $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 1 before projection into the generation block (Liu et al., 15 Nov 2025). WhiFlash routes each decoding step between an autoregressive drafter $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 2 and a diffusion drafter $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 3, using either an entropy threshold on the target model’s next-token distribution or a learned MLP regressor that predicts the difference in acceptance lengths (Kwon et al., 5 Jun 2026).

A plausible implication is that TARo is orthogonal to the backbone family. The same routing granularity has been applied to transformers, SSM–Transformer hybrids, recursive transformers, multimodal diffusion models, and speculative decoding systems.

5. Empirical trade-offs and observed routing behavior

The central empirical claim of TARo papers is that token-wise conditional computation yields better speed–accuracy or cost–quality trade-offs than static allocation. On PubMedQA, MambaFormer reports BERTScore F1 $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 4 at latency $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 5, compared with BioBERT at $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 6, Mamba at $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 7, and hybrid static baselines at BERTScore $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 8– $R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},$ 9– $x_i$ 0. The paper states that MambaFormer is “24.4× faster than T5-Large” and that the token assignment distribution is 96.2% to EMamba and 3.8% to ET5, with ET5 handling “short, complex queries” and EMamba handling “long contexts” (Khan et al., 3 Jan 2026).

DynaMoE reports that, for a Small model on MNIST, the Descending schedule reaches 92.68% accuracy, compared with 91.35 for Uniform and 89.42 for the MLP Baseline. Cross-dataset gains reported in the paper are 89.42 $x_i$ 192.68 on MNIST, 84.15 $x_i$ 288.34 on Fashion-MNIST, and 62.38 $x_i$ 367.85 on CIFAR-10; the paper also reports expert-usage entropy $x_i$ 4 bits versus fixed Top-2 routing (Gülmez, 2 Mar 2026). MEMatte reports approximately 88% memory reduction and about 50% latency reduction on Composition-1K while keeping SAD close to ViTMatte: for ViT-S, 6.20 GB and 186.0 ms for ViTMatte-S versus 0.71 GB and 84.99 ms for MEMatte-S; for ViT-B, 12.53 GB and 340.2 ms versus 1.49 GB and 178.9 ms (Lin et al., 2024).

In reasoning-time alignment, TARo improves MATH500 accuracy for Llama-3.1-8B plus a distilled reward model from 32.0% for the base model to 54.4%, compared with 49.2% for GenARM with fixed $x_i$ 5. The prompt-level routing ablation reaches only 33.2% on MATH500, versus 49.6% for token-level routing, and the qualitative analysis reports that high- $x_i$ 6 tokens are often “mathematical operators, ‘Step,’ ‘Compute,’ etc.” while low- $x_i$ 7 tokens are ordinary context words (Rai et al., 19 Mar 2026). TARPO reports improvements over GRPO across Qwen2.5 model sizes and on Llama-3.1-8B, with additional evidence that generated token count on OOD evaluation fell from $x_i$ 8 to $x_i$ 9 (Zhang et al., 4 Jun 2026).

In inference systems, WhiFlash reports, for Qwen3-8B overall average, acceptance length 5.12 for EAGLE-3, 5.29 for DFlash, 6.81 for Oracle-Token, 6.08 for WhiFlash-Entropy, and 6.26 for WhiFlash-Neural. Throughput speedups over AR decoding are 3.55× for DFlash, 3.82× for WhiFlash-Entropy, and 3.87× for WhiFlash-Neural; category-specific peak gains on Qwen3-8B reach +69.6% TPS over AR for Math and +37.3% TPS over diffusion for Chat (Kwon et al., 5 Jun 2026). The edge-device routing system reports a “60% performance gain on CommonsenseQA” using a 0.5B model on an M1 MacBook, with “under 7% of tokens generation uploaded to the large model in the cloud” (She et al., 10 Apr 2025).

Adapter and specialization routing also shows strong empirical effects. MoLoRA reports that Qwen3-1.7B + MoLoRA surpasses Qwen3-8B on GSM8K, MATH, BBH, and GPQA while being 4.7× smaller, and that on a mixed-modality workload with $l_i\in[0,1]$ 0 adapters, per-sequence routing takes 5.88 ms whereas per-token TARo takes 1.43 ms, a 4.1× speedup, with 5.5× end-to-end improvement under full CUDA graph capture (Shah et al., 16 Mar 2026). FusionRoute reports average cross-domain accuracy 0.566 on Llama-3, compared with 0.536 for fine-tuning, 0.502 for Collab, and 0.466 for sequence-level selection, while ablations show that removing complementary logits loses approximately 4–5 points of average accuracy (Xiong et al., 8 Jan 2026).

These results do not establish a single Pareto frontier for all TARo systems, but they do show that token-wise routing repeatedly outperforms fixed per-sequence or fixed per-layer choices when the routed resource is expensive and heterogeneous.

6. Misconceptions, limitations, and open directions

A common misconception is that TARo is synonymous with hard expert selection in MoE. The literature is broader. MambaFormer uses hard top-1 routing (Khan et al., 3 Jan 2026), but DynaMoE uses a token-dependent $l_i\in[0,1]$ 1 with soft aggregation (Gülmez, 2 Mar 2026), TARo alignment uses a continuous $l_i\in[0,1]$ 2 for logit interpolation (Rai et al., 19 Mar 2026), Meta-Attention uses Bayesian routing weights and uncertainty-gated hardening (Ferrari, 27 May 2026), and Informed Routing uses a binary execute-or-approximate policy rather than execute-or-skip (Han et al., 10 Oct 2025).

Another misconception is that token routing is always equivalent to permanent token pruning. MEMatte explicitly avoids permanent token loss: routed tokens go either to global attention or to LTRM, and the outputs are re-concatenated in original token order (Lin et al., 2024). MoR similarly routes tokens to different recursion depths rather than deleting them from computation entirely (Bae et al., 14 Jul 2025). AdaMoE’s “null experts” make the number of true experts per token adaptive, but the top- $l_i\in[0,1]$ 3 routing interface itself remains intact (Zeng et al., 2024).

The limitations reported in the papers are also heterogeneous. TARo for test-time alignment notes that current reward models are trained only on math stepwise preferences and suggests richer, domain-specific rewards for stronger OOD performance (Rai et al., 19 Mar 2026). WhiFlash depends on cache-management optimizations such as Lazy Catch-up and KV-only Prefill to keep high-frequency switching overhead below 7% of per-round latency (Kwon et al., 5 Jun 2026). The edge-device collaboration system reports that current ONNX-based SLM runtimes do not support incremental KV-cache injection, forcing full re-prefill on each deferred token (She et al., 10 Apr 2025). Meta-Attention identifies tuning burdens for the prior strength $l_i\in[0,1]$ 4, floor $l_i\in[0,1]$ 5, and hard-routing threshold $l_i\in[0,1]$ 6, and reports a +6.3% relative PPL overhead for the Bayesian controller in its Phase 1 Tiny LM benchmark (Ferrari, 27 May 2026).

Informed Routing states an upper bound on sparsity for simple forecasters: beyond $l_i\in[0,1]$ 7–50%, the LFF cannot capture complex transformations well, and at 70% sparsity both PPL and reasoning accuracy degrade sharply (Han et al., 10 Oct 2025). FusionRoute argues more fundamentally that pure expert-only routing is limited: unless strong global coverage assumptions hold, token-level routing based solely on fixed expert outputs cannot in general realize the optimal decoding policy, which motivates the addition of a trainable complementary generator (Xiong et al., 8 Jan 2026).

Open directions are stated explicitly across several papers. TARo alignment suggests extending routing to more than two experts and incorporating backtracking or global rollout checks (Rai et al., 19 Mar 2026). WhiFlash proposes extension to any heterogeneous pool of draft experts (Kwon et al., 5 Jun 2026). MoS points to dual-way routing between modalities and adaptation to audio, video, and early-fusion architectures (Liu et al., 15 Nov 2025). Informed Routing suggests richer forecasters, adaptive forecaster capacity, and multimodal or long-context extensions (Han et al., 10 Oct 2025). Taken together, these proposals indicate that TARo is evolving from a narrow MoE routing tactic into a general mechanism for token-wise allocation of heterogeneous computation, model specialization, and reasoning control.