Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token-level Adaptive Routing (TARo)

Updated 4 July 2026
  • Token-level Adaptive Routing (TARo) is a conditional computation mechanism that makes per-token routing decisions to allocate heterogeneous model resources based on token complexity.
  • It employs various routing strategies—including hard selection, soft aggregation, and Bayesian approaches—to dynamically choose among experts, adapters, or attention modules.
  • Empirical studies show that TARo improves speed–accuracy trade-offs and reduces computational cost, yielding significant latency and performance gains in diverse applications.

Searching arXiv for papers on Token-level Adaptive Routing and closely related usages of the term. arxiv_search("Token-level Adaptive Routing TARo")

Token-level Adaptive Routing (TARo) designates a class of conditional-computation mechanisms in which the routing decision is made for individual tokens or generation steps rather than for an entire sequence, a fixed expert budget, or a uniform network path. In the recent literature, TARo appears in several distinct but structurally related forms: token-level dynamic routing to a linear-time State Space Model expert or a quadratic-time Transformer expert in clinical question answering, token-dependent variation of the number of active experts in Mixture-of-Experts layers, adaptive interpolation between frozen base-model and reward-model logits during test-time alignment, token-wise selection among multiple attention mechanisms, per-token recursion-depth assignment, per-token collaboration between small and LLMs, and per-token switching between autoregressive and diffusion drafters or between discrete and latent reasoning modes (Khan et al., 3 Jan 2026, Gülmez, 2 Mar 2026, Rai et al., 19 Mar 2026, Ferrari, 27 May 2026, Zheng et al., 4 Feb 2025, Kwon et al., 5 Jun 2026, Zhang et al., 4 Jun 2026).

1. Scope and problem setting

Across these works, TARo addresses a common limitation of static computation allocation: different tokens within the same input may have different computational requirements, different preferred inductive biases, or different risk profiles. In MoE settings, the motivating restriction is fixed Top-KK routing, where exactly KK experts are activated per token; DynaMoE explicitly relaxes this assumption and allows the number of active experts per token to vary based on input complexity (Gülmez, 2 Mar 2026). In collaborative inference, CITER routes “non-critical tokens” to an SLM for efficiency and “critical tokens” to an LLM for generalization quality (Zheng et al., 4 Feb 2025). In test-time alignment, TARo replaces a fixed mixing coefficient α\alpha between a base model and a reward model with a learnable token-level router because the “optimal α\alpha varies across tasks, domains, and even decoding steps” (Rai et al., 19 Mar 2026).

A closely related line of work uses token routing even when the term TARo is not the sole organizing label. MEMatte inserts a router immediately before each global-attention layer and sends informative tokens to global attention while routing other tokens to a Lightweight Token Refinement Module, thereby reducing the quadratic burden of global self-attention on high-resolution image matting (Lin et al., 2024). AdaMoE permits a variable number of “true” experts per token by augmenting the expert set with “null experts,” which consume zero FLOPs and fill otherwise fixed top-kk slots (Zeng et al., 2024). MoLoRA makes a routing decision for every token over domain-specific LoRA adapters, while FusionRoute selects an expert at each decoding step and also adds a complementary logit from the router’s base LLM (Shah et al., 16 Mar 2026, Xiong et al., 8 Jan 2026).

Representative setting Routed object Representative decision
Clinical hybrid MoE EMamba vs ET5 hard top-1 token routing (Khan et al., 3 Jan 2026)
Dynamic MoE number of active experts percentile-threshold token routing (Gülmez, 2 Mar 2026)
Test-time alignment base vs reward guidance adaptive αt\alpha_t per token (Rai et al., 19 Mar 2026)
Efficient vision transformers global attention vs LTRM token-wise branch split (Lin et al., 2024)
Collaborative decoding SLM vs LLM token-level model choice (Zheng et al., 4 Feb 2025)
Speculative decoding AR vs diffusion drafter per-step paradigm selection (Kwon et al., 5 Jun 2026)

This suggests that TARo is better understood as a routing principle than as a single architecture: the routed entities may be experts, model heads, adapters, attention mechanisms, drafters, or reasoning modes, but the granularity of control is consistently token-wise.

2. Routing functions and selection rules

A defining feature of TARo is that the routing function is evaluated from token-local or step-local state. In MambaFormer, the router input for token xix_i is the fused feature

Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},

where xix_i is the contextual embedding, li[0,1]l_i\in[0,1] is normalized sequence length, and KK0 is a binary domain flag. A 2-layer MLP computes

KK1

followed at inference by the hard routing decision

KK2

The selected expert then processes the token, and the paper states explicitly that this is “equivalent to a ‘top-1’ selection per token” with “No soft mixture or KK3 selection” (Khan et al., 3 Jan 2026).

DynaMoE generalizes the selection rule by replacing fixed Top-KK4 with a percentile threshold. For token representation KK5, the gate computes KK6, KK7, defines a threshold KK8, and selects

KK9

with α\alpha0. The layer output is then a soft aggregation over the selected experts,

α\alpha1

and a minimum-activation step guarantees α\alpha2 (Gülmez, 2 Mar 2026).

In LLM test-time alignment, TARo does not route a token to one expert or another; instead it routes the amount of guidance. The router consumes either full-logit concatenations or top-α\alpha3 logits with index embeddings and produces

α\alpha4

This adaptive mixing weight defines

α\alpha5

or equivalently

α\alpha6

The route is therefore continuous rather than categorical, but it remains token-level and inference-time (Rai et al., 19 Mar 2026).

Other instantiations preserve the same logic with different routed objects. Meta-Attention forms a token feature α\alpha7, uses a 2-layer MLP to parameterize a Dirichlet posterior over routing weights α\alpha8, and can route each token to full softmax attention, linear attention, or sliding-window local attention (Ferrari, 27 May 2026). MoLoRA uses a two-layer MLP α\alpha9, followed by TopK and temperature-scaled softmax over selected adapters (Shah et al., 16 Mar 2026). Informed Routing replaces execute-or-skip with execute-or-approximate: the router predicts whether a transformer unit α\alpha0 should run in full or whether the token should go through a Lightweight Feature Forecaster α\alpha1 (Han et al., 10 Oct 2025).

A plausible implication is that TARo should not be identified with a single routing algebra. The literature contains hard top-1 selection, variable-cardinality selection, weighted soft aggregation, binary action routing, and Bayesian posterior routing, all at token granularity.

3. Optimization objectives and training regimes

The optimization of TARo varies substantially across applications. In MambaFormer, the router parameters α\alpha2 are trained while the experts are frozen. The loss is

α\alpha3

with

α\alpha4

α\alpha5

and

α\alpha6

using α\alpha7, α\alpha8, and α\alpha9. The paper states that this enforces that only kk0 of tokens choose ET5 on average, yielding the 3.8% observed in practice (Khan et al., 3 Jan 2026).

CITER formulates token routing as policy optimization in a finite-horizon MDP with actions kk1 and kk2. Under the simplifying choices kk3 and kk4, the router is trained by minimizing token-wise cross-entropy on binary preference labels,

kk5

The paper also introduces a three-case shortcut for reward estimation, and reports that “~80–90% of tokens fall into Cases 1–2,” reducing the cost of preference collection by kk6 (Zheng et al., 4 Feb 2025).

TARPO instead uses pure RL. A lightweight action head parameterized by kk7, kk8 computes

kk9

where αt\alpha_t0. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal αt\alpha_t1, and the minibatch objective combines token loss, action loss, and KL regularization (Zhang et al., 4 Jun 2026).

In Bayesian TARo, Meta-Attention places a compute-aware Dirichlet prior over routing weights and trains the amortized posterior αt\alpha_t2 using an ELBO objective. The KL term against the prior supplies a principled alternative to ad hoc balancing losses and also yields a posterior-entropy signal αt\alpha_t3 for soft-to-hard routing (Ferrari, 27 May 2026). By contrast, DynaMoE states that “No auxiliary balancing losses are required, but large-scale deployment may need capacity factors” (Gülmez, 2 Mar 2026).

In execute-or-approximate routing, training is two-stage. Informed Routing first freezes the base LLM and fits each Lightweight Feature Forecaster using

αt\alpha_t4

then freezes the LFFs and trains routers with

αt\alpha_t5

using Gumbel-Softmax to sample hard masks (Han et al., 10 Oct 2025). MoLoRA, in turn, trains LoRA and router parameters jointly with a standard language-modeling loss plus a Switch-style load-balancing auxiliary term (Shah et al., 16 Mar 2026).

These differences matter conceptually. Some TARo systems train only the router, some train router and backbone jointly, some freeze the routed experts, and some incorporate explicit compute terms in the loss while others rely on priors, schedules, or threshold calibration.

4. Architectural realizations

The architectural role of TARo depends on what part of the model is being conditionally activated. In hybrid MoE clinical assistance, MambaFormer places TARo over two experts: EMamba, “a linear-time State Space Model (SSM) expert,” and ET5, “a quadratic-time Transformer (T5-Large) expert.” The router uses contextual embeddings, normalized sequence length, and a domain-aware flag to decide which tokens should incur Transformer cost (Khan et al., 3 Jan 2026).

In efficient vision transformers, MEMatte inserts a Router plus a routing decision immediately before each global-attention layer. Tokens are split into a Global-Attention branch and a Lightweight Token Refinement Module branch, and after each block the two branches’ outputs are “simply re-concatenated (in the original token order)” before the next router. Training uses Batch-constrained Adaptive Token Routing (BATR), with a batch-level average routed ratio

αt\alpha_t6

and a compression loss αt\alpha_t7 (Lin et al., 2024).

In attention-mechanism routing, Meta-Attention maintains αt\alpha_t8 experts: full softmax attention, linear attention, and sliding-window local attention, with normalized costs αt\alpha_t9, xix_i0, and xix_i1. The Bayesian Meta-Controller can run all experts under soft routing or perform uncertainty-gated hard routing based on posterior entropy (Ferrari, 27 May 2026). In recursive transformers, MoR’s token-choice router assigns each token an up-front recursion depth xix_i2, so that only tokens with xix_i3 remain active at recursion xix_i4; attention at step xix_i5 is restricted to this active subset, and the framework supports both recursion-wise KV caching and recursive KV sharing (Bae et al., 14 Jul 2025).

In collaborative LLM systems, the routed object is usually the model invocation itself. CITER switches between SLM and LLM at each timestep while maintaining separate KV caches so that switching back does not require recomputing history (Zheng et al., 4 Feb 2025). The edge-device inference system based on the CITER router keeps the SLM on-device, deploys the token router on-device, and defers low-confidence tokens to a cloud LLM served under SGLang (She et al., 10 Apr 2025). FusionRoute uses a small linear routing head on a base LLM hidden state to select among experts at token step xix_i6, but it also adds the router LLM’s own logit vector to the selected expert’s logits,

xix_i7

so the router acts simultaneously as selector and complementary generator (Xiong et al., 8 Jan 2026).

TARo also appears in multimodal and diffusion settings. MoS routes hidden states from an understanding tower into a generation tower, producing denoising-timestep- and token-dependent interactions. For each generation block xix_i8, it uses xix_i9-greedy Top-Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},0 selection over context states and constructs a fused state Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},1 before projection into the generation block (Liu et al., 15 Nov 2025). WhiFlash routes each decoding step between an autoregressive drafter Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},2 and a diffusion drafter Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},3, using either an entropy threshold on the target model’s next-token distribution or a learned MLP regressor that predicts the difference in acceptance lengths (Kwon et al., 5 Jun 2026).

A plausible implication is that TARo is orthogonal to the backbone family. The same routing granularity has been applied to transformers, SSM–Transformer hybrids, recursive transformers, multimodal diffusion models, and speculative decoding systems.

5. Empirical trade-offs and observed routing behavior

The central empirical claim of TARo papers is that token-wise conditional computation yields better speed–accuracy or cost–quality trade-offs than static allocation. On PubMedQA, MambaFormer reports BERTScore F1 Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},4 at latency Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},5, compared with BioBERT at Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},6, Mamba at Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},7, and hybrid static baselines at BERTScore Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},8–Ri=[xi;li;d]R(d+2),R_i = [x_i; l_i; d] \in \mathbb{R}^{(d+2)},9–xix_i0. The paper states that MambaFormer is “24.4× faster than T5-Large” and that the token assignment distribution is 96.2% to EMamba and 3.8% to ET5, with ET5 handling “short, complex queries” and EMamba handling “long contexts” (Khan et al., 3 Jan 2026).

DynaMoE reports that, for a Small model on MNIST, the Descending schedule reaches 92.68% accuracy, compared with 91.35 for Uniform and 89.42 for the MLP Baseline. Cross-dataset gains reported in the paper are 89.42xix_i192.68 on MNIST, 84.15xix_i288.34 on Fashion-MNIST, and 62.38xix_i367.85 on CIFAR-10; the paper also reports expert-usage entropy xix_i4 bits versus fixed Top-2 routing (Gülmez, 2 Mar 2026). MEMatte reports approximately 88% memory reduction and about 50% latency reduction on Composition-1K while keeping SAD close to ViTMatte: for ViT-S, 6.20 GB and 186.0 ms for ViTMatte-S versus 0.71 GB and 84.99 ms for MEMatte-S; for ViT-B, 12.53 GB and 340.2 ms versus 1.49 GB and 178.9 ms (Lin et al., 2024).

In reasoning-time alignment, TARo improves MATH500 accuracy for Llama-3.1-8B plus a distilled reward model from 32.0% for the base model to 54.4%, compared with 49.2% for GenARM with fixed xix_i5. The prompt-level routing ablation reaches only 33.2% on MATH500, versus 49.6% for token-level routing, and the qualitative analysis reports that high-xix_i6 tokens are often “mathematical operators, ‘Step,’ ‘Compute,’ etc.” while low-xix_i7 tokens are ordinary context words (Rai et al., 19 Mar 2026). TARPO reports improvements over GRPO across Qwen2.5 model sizes and on Llama-3.1-8B, with additional evidence that generated token count on OOD evaluation fell from xix_i8 to xix_i9 (Zhang et al., 4 Jun 2026).

In inference systems, WhiFlash reports, for Qwen3-8B overall average, acceptance length 5.12 for EAGLE-3, 5.29 for DFlash, 6.81 for Oracle-Token, 6.08 for WhiFlash-Entropy, and 6.26 for WhiFlash-Neural. Throughput speedups over AR decoding are 3.55× for DFlash, 3.82× for WhiFlash-Entropy, and 3.87× for WhiFlash-Neural; category-specific peak gains on Qwen3-8B reach +69.6% TPS over AR for Math and +37.3% TPS over diffusion for Chat (Kwon et al., 5 Jun 2026). The edge-device routing system reports a “60% performance gain on CommonsenseQA” using a 0.5B model on an M1 MacBook, with “under 7% of tokens generation uploaded to the large model in the cloud” (She et al., 10 Apr 2025).

Adapter and specialization routing also shows strong empirical effects. MoLoRA reports that Qwen3-1.7B + MoLoRA surpasses Qwen3-8B on GSM8K, MATH, BBH, and GPQA while being 4.7× smaller, and that on a mixed-modality workload with li[0,1]l_i\in[0,1]0 adapters, per-sequence routing takes 5.88 ms whereas per-token TARo takes 1.43 ms, a 4.1× speedup, with 5.5× end-to-end improvement under full CUDA graph capture (Shah et al., 16 Mar 2026). FusionRoute reports average cross-domain accuracy 0.566 on Llama-3, compared with 0.536 for fine-tuning, 0.502 for Collab, and 0.466 for sequence-level selection, while ablations show that removing complementary logits loses approximately 4–5 points of average accuracy (Xiong et al., 8 Jan 2026).

These results do not establish a single Pareto frontier for all TARo systems, but they do show that token-wise routing repeatedly outperforms fixed per-sequence or fixed per-layer choices when the routed resource is expensive and heterogeneous.

6. Misconceptions, limitations, and open directions

A common misconception is that TARo is synonymous with hard expert selection in MoE. The literature is broader. MambaFormer uses hard top-1 routing (Khan et al., 3 Jan 2026), but DynaMoE uses a token-dependent li[0,1]l_i\in[0,1]1 with soft aggregation (Gülmez, 2 Mar 2026), TARo alignment uses a continuous li[0,1]l_i\in[0,1]2 for logit interpolation (Rai et al., 19 Mar 2026), Meta-Attention uses Bayesian routing weights and uncertainty-gated hardening (Ferrari, 27 May 2026), and Informed Routing uses a binary execute-or-approximate policy rather than execute-or-skip (Han et al., 10 Oct 2025).

Another misconception is that token routing is always equivalent to permanent token pruning. MEMatte explicitly avoids permanent token loss: routed tokens go either to global attention or to LTRM, and the outputs are re-concatenated in original token order (Lin et al., 2024). MoR similarly routes tokens to different recursion depths rather than deleting them from computation entirely (Bae et al., 14 Jul 2025). AdaMoE’s “null experts” make the number of true experts per token adaptive, but the top-li[0,1]l_i\in[0,1]3 routing interface itself remains intact (Zeng et al., 2024).

The limitations reported in the papers are also heterogeneous. TARo for test-time alignment notes that current reward models are trained only on math stepwise preferences and suggests richer, domain-specific rewards for stronger OOD performance (Rai et al., 19 Mar 2026). WhiFlash depends on cache-management optimizations such as Lazy Catch-up and KV-only Prefill to keep high-frequency switching overhead below 7% of per-round latency (Kwon et al., 5 Jun 2026). The edge-device collaboration system reports that current ONNX-based SLM runtimes do not support incremental KV-cache injection, forcing full re-prefill on each deferred token (She et al., 10 Apr 2025). Meta-Attention identifies tuning burdens for the prior strength li[0,1]l_i\in[0,1]4, floor li[0,1]l_i\in[0,1]5, and hard-routing threshold li[0,1]l_i\in[0,1]6, and reports a +6.3% relative PPL overhead for the Bayesian controller in its Phase 1 Tiny LM benchmark (Ferrari, 27 May 2026).

Informed Routing states an upper bound on sparsity for simple forecasters: beyond li[0,1]l_i\in[0,1]7–50%, the LFF cannot capture complex transformations well, and at 70% sparsity both PPL and reasoning accuracy degrade sharply (Han et al., 10 Oct 2025). FusionRoute argues more fundamentally that pure expert-only routing is limited: unless strong global coverage assumptions hold, token-level routing based solely on fixed expert outputs cannot in general realize the optimal decoding policy, which motivates the addition of a trainable complementary generator (Xiong et al., 8 Jan 2026).

Open directions are stated explicitly across several papers. TARo alignment suggests extending routing to more than two experts and incorporating backtracking or global rollout checks (Rai et al., 19 Mar 2026). WhiFlash proposes extension to any heterogeneous pool of draft experts (Kwon et al., 5 Jun 2026). MoS points to dual-way routing between modalities and adaptation to audio, video, and early-fusion architectures (Liu et al., 15 Nov 2025). Informed Routing suggests richer forecasters, adaptive forecaster capacity, and multimodal or long-context extensions (Han et al., 10 Oct 2025). Taken together, these proposals indicate that TARo is evolving from a narrow MoE routing tactic into a general mechanism for token-wise allocation of heterogeneous computation, model specialization, and reasoning control.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-level Adaptive Routing (TARo).