Papers
Topics
Authors
Recent
Search
2000 character limit reached

MaxScore Routing: Optimal Wireless & MoE

Updated 18 April 2026
  • MaxScore routing is a framework that optimizes relay selection in wireless networks and token assignments in MoE, ensuring near-optimal performance under partial CSI and capacity constraints.
  • In wireless ad hoc networks, it leverages partial channel state information and Monte Carlo techniques to maximize the Asymptotic Density of Rate-Progress, outperforming classic heuristics by up to 180%.
  • For sparse Mixture-of-Experts, the method casts token-to-expert assignment as a min-cost maximum-flow problem using a SoftTopk operator to achieve balanced, differentiable routing without token dropping.

Maximum Score Routing (MaxScore) refers to a family of methods for optimal assignment or relay selection developed independently for two distinct research domains: (1) distributed routing in random wireless ad hoc networks under partial channel state information (CSI), and (2) sparse Mixture-of-Experts (MoE) deep learning architectures, where routing tokens to experts is cast as a flow optimization problem. Despite the divergent contexts, both approaches share the defining principle of maximizing a rigorously defined “score” function per candidate, subject to structural constraints, yielding provably optimal or near-optimal results compared to classical heuristics.

1. MaxScore in Wireless Ad Hoc Networks: System Model and Score Definition

In the context of wireless ad hoc networks, MaxScore is formalized as the statistically-optimal (SO) one-hop routing mechanism based strictly on partial CSI—specifically, the instantaneous locations and channel gains to all neighbor nodes within a routing zone of fixed radius. The model assumes nodes are distributed per a homogeneous planar Poisson point process (PPP) of density λ\lambda, operating under a slotted ALOHA MAC with per-slot transmission probability ptxp_{\mathrm{tx}} and Rayleigh fading. The instantaneous received rate over the link jij \rightarrow i is modeled as Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2)), where Si,j=ρri,jαWi,jS_{i,j} = \rho r_{i,j}^{-\alpha} W_{i,j} is the instantaneous signal power, and Ji,jJ_{i,j} models aggregate interference from concurrently transmitting PPP nodes.

The optimal routing decision at each transmitting node is governed by the maximization of the Asymptotic Density of Rate-Progress (ADORP), which acts as a rigorous proxy for aggregate throughput-distance. Formally, for known local CSI M0\mathcal{M}_0, the MaxScore (SO) selects the neighbor ii^* maximizing

mSO(i,M0)=ri,0EJi,0M0[log2(1+ρri,0αWi,0Ji,0+σv2)],m_{\mathrm{SO}}(i,\mathcal{M}_0) = r_{i,0}\, \mathbb{E}_{J_{i,0}|\mathcal{M}_0}\left[\log_2\left(1 + \frac{\rho\,r_{i,0}^{-\alpha}W_{i,0}}{J_{i,0} + \sigma_v^2}\right)\right],

where ri,0r_{i,0} is the distance and ptxp_{\mathrm{tx}}0 the fading gain. Each decision is independent and optimal because routing does not influence the spatial statistics of future interferences (Richter et al., 2018).

2. MaxScore for Sparse Mixture-of-Experts: Flow Modeling and SoftTopk

In large-scale MoE models, MaxScore Routing reinterprets the token-to-expert assignment as a minimum-cost maximum-flow in a bipartite graph. Each batch consists of ptxp_{\mathrm{tx}}1 tokens and ptxp_{\mathrm{tx}}2 experts, each expert with fixed token capacity ptxp_{\mathrm{tx}}3. For each token ptxp_{\mathrm{tx}}4, affinity scores ptxp_{\mathrm{tx}}5 for expert ptxp_{\mathrm{tx}}6 are computed using a differentiable SoftTopk operator, which modifies the top-ptxp_{\mathrm{tx}}7 scoring mechanism to yield smoothly assignable, balanced gradients.

The routing problem is formulated as:

ptxp_{\mathrm{tx}}8

and ptxp_{\mathrm{tx}}9 indicates token jij \rightarrow i0 routed to expert jij \rightarrow i1. The cost is jij \rightarrow i2, and the constraints enforce per-token and per-expert assignment restrictions (Dong et al., 18 Aug 2025).

SoftTopk produces soft affinity distributions that are differentiable, enabling efficient training by propagating gradients to the gating mechanism.

3. Algorithms and Implementation

(a) Wireless Routing: Optimal and Suboptimal Schemes

The SO (MaxScore) routing performs, for each neighbor jij \rightarrow i3 in the routing zone:

  • Compute jij \rightarrow i4,
  • Numerically estimate jij \rightarrow i5, typically via Monte Carlo over feasible interferences,
  • Score jij \rightarrow i6,
  • Select jij \rightarrow i7 (Richter et al., 2018).

Low-complexity variants—Bound-Optimal (BO), Narrow-Knowledge SO (NSO), Narrow-Knowledge BO (NBO)—replace jij \rightarrow i8 by increasingly coarse deterministic bounds or lookups on interference, trading a marginal loss (≤4%) in aggregate performance for reduced computational requirements.

(b) MaxScore MoE Routing: Two-Stage Flow Assignment

MaxScore for MoE adopts a two-pass routing process:

  • Compute affinities jij \rightarrow i9 using SoftTopk,
  • First-stage assignment selects the top-1 expert for each token greedily,
  • For residual capacity and unassigned slots, apply a Sinkhorn approximation to assign remaining top-Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2))0 token-expert pairs, maintaining exact or near-exact expert fill without dropping tokens,
  • The process is fully differentiable and optimized for tensorized GPU computation (Dong et al., 18 Aug 2025).

This approach contrasts with prior MoE routing (e.g., GShard, DropLess), which may require expert padding or token dropping, compromising either computational or model efficiency.

4. Theoretical Optimality and Analysis

In wireless networks, MaxScore routing is proven optimal for maximizing ADORP under partial CSI, given the spatial independence properties of the PPP and local-only CSI. The optimality is preserved under the specific interference-statistics-invariant property of the setting, as routing choices do not alter the global distribution of interfering transmitters. Suboptimal schemes are analytically demonstrated to remain within 4% of SO’s performance and consistently outperform traditional geographic or threshold-based greedy heuristics by 30–180% in simulated throughput (Richter et al., 2018).

In MoE networks, formulating routing as a maximum-flow optimization with SoftTopk ensures that all tokens are assigned, expert loads are balanced, and computational efficiency is maximized without recourse to token dropping or padding. These properties are achieved while maintaining hardware throughput and memory usage comparable to, or slightly below, existing approaches (Dong et al., 18 Aug 2025).

Simulation and training results across both domains validate the superiority of MaxScore methodologies:

  • Wireless Ad Hoc Networks: Low-complexity BO, NSO, and NBO routing schemes trail SO by ≤4% in ADORP, with BO nearly matching SO (≈1% loss). All MaxScore variants surpass classical geographic or nearest-neighbor routing by a substantial margin; optimal Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2))1 is empirically identified near 0.2 for typical parameter regimes, with small routing zones degrading performance across methods but still maintaining MaxScore’s lead (Richter et al., 2018).
  • Sparse MoE: On LLaMA-style Transformer architectures trained on C4 (65B tokens), MaxScore consistently achieves higher evaluation scores and faster convergence than GShard, DropLess, and DeepSeek baselines. Average accuracy gains are Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2))2 vs. GShard. At sparsity ratio Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2))3, MaxScore reaches 44.21% vs. 42.81% for GShard. MaxScore fully eliminates token dropping and achieves exact or near-exact expert utilization, with identical model FLOPs and negligible extra routing overhead (Dong et al., 18 Aug 2025).
  • Ablation Analysis: The combined use of SoftTopk and flow-based allocation in MaxScore demonstrates superadditive performance gains over their isolated use, confirming the necessity of both innovations for optimality (Dong et al., 18 Aug 2025).

6. Comparative Summary of MaxScore Methodologies

Domain Routing Objective Score Definition Complexity Performance Gap (vs. optimal)
Wireless Ad Hoc (SO) Maximize ADORP Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2))4 High Optimal
Wireless (BO, NSO, NBO) Maximize lower-bounded ADORP Deterministic/integral bounds or lookup Low–Medium ≤4%
MoE MaxScore Maximize affinity allocation Ri,j=Blog2(1+Si,j/(Ji,j+σv2))R_{i,j}=B\log_2(1+S_{i,j}/(J_{i,j}+\sigma_v^2))5 (min-cost flow) Medium Near-optimal

7. Broader Impact and Insights

MaxScore reframes the routing problem in both wireless networking and neural architectures as an explicit optimization of a performance metric under information and resource constraints. In wireless networks, it leverages statistical independence and partial CSI to realize a distributed protocol with guaranteed throughput benefits. In MoE systems, MaxScore resolves inefficiencies due to expert capacity constraints and gradient imbalance, implementing a scalable, differentiable, and token-efficient mechanism for deep learning architectures. A plausible implication is that analogous maximum-score formulations may be transferrable to other constrained resource allocation tasks where local information or differentiable structure is essential for scalable optimization.

References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Maximum Score Routing (MaxScore).