Papers
Topics
Authors
Recent
Search
2000 character limit reached

Remoe: Differentiable & Serverless MoE

Updated 28 December 2025
  • Remoe is a dual concept incorporating both a fully differentiable ReMoE architecture and a serverless MoE inference system for elastic LLM deployment.
  • ReMoE employs ReLU-based routing with adaptive L1 regularization to enable continuous gradient flow and improved specialization across experts.
  • The Remoe system optimizes resource usage through SPS, MMP, and joint memory-replica techniques, achieving significant cost and latency reductions.

Remoe

Remoe refers to two distinct but technically advanced concepts in recent academic literature: (1) a fully differentiable Mixture-of-Experts (MoE) architecture termed “ReMoE” for efficient neural network inference, and (2) a heterogeneous, serverless-enabled MoE inference system termed “Remoe” tailored for LLMs deployed in elastic computing environments. Both share the core focus of MoE neural network architectures, but differ in their technical contributions: the former targets differentiability and scalability via ReLU-based routing, while the latter introduces cost-efficient, algorithmically optimized MoE serving in distributed, serverless contexts.

1. Mixture-of-Experts in Neural Networks

A Mixture-of-Experts (MoE) neural network contains EE parallel “expert” subnets in selected layers. For each input token xx, only a subset (typically kEk\ll E) of experts receives xx’s representation, with a gating/router function determining which experts participate. This architecture enables aggressive sparsification and scale-out of compute and parameters with sub-linear runtime growth relative to model size.

Conventional MoE designs route by a TopK+Softmax mechanism: the router G(x)G(x) computes logits for each expert, selects the kk largest, normalizes via Softmax, and gates those experts. However, TopK gating introduces sharp nondifferentiabilities and unpredictable token-to-expert assignments as EE increases, destabilizing training and impairing scalability (Wang et al., 2024).

2. ReMoE: ReLU-Routing Differentiable Mixture-of-Experts

ReMoE (Fully Differentiable Mixture-of-Experts with ReLU Routing) replaces the non-differentiable TopK router with a ReLU-based linear router. For each token tt in layer ll (tlRd^l_t\in\mathbb{R}^d), routing weights are computed as:

R(tl)=ReLU(tlWl+bl)RER(^l_t) = \operatorname{ReLU}(^l_t W_l + b_l) \in \mathbb{R}^E

where WlRd×EW_l\in\mathbb{R}^{d\times E} and blREb_l\in\mathbb{R}^E. Each expert ee is activated only if its pre-activation is positive, providing sparsity without discrete selection (Wang et al., 2024).

To ensure controlled computational budget, an adaptive L1L_1 regularizer penalizes the sum of router activations, with a feedback-driven multiplier λi\lambda_i adjusted so the empirical sparsity matches the target:

λi+1=λi×αsign((1kE)Si),Si=11LTEl,t,e1{R(tl)e>0}\lambda_{i+1} = \lambda_i \times \alpha^{\operatorname{sign}((1-\tfrac{k}{E}) - S_i)}, \quad S_i = 1 - \frac{1}{L\,T\,E}\sum_{l,t,e}\mathbf{1}\{R(^l_t)_e>0\}

A load-balancing term reweights expert activation penalties, avoiding expert collapse. Crucially, this architecture supports continuous gradient flow throughout the gating path—unlike TopK’s hard routing, which introduces training discontinuities at token-expert assignment boundaries.

Dynamic compute allocation arises as each token can route to a variable number of experts: common (semantically simple) tokens often activate fewer, rare tokens more. The L1L_1 penalty holds the average expert usage close to kk. This results in (statistically) similar flops/token as TopK but net improved domain specialization and throughput (Wang et al., 2024).

Empirical validation covers models with N=182N=182M–978M parameters and E=4E=4–128 experts. ReMoE consistently beats TopK in zero-shot accuracy (+0.36%), scaling to higher expert counts (declining validation loss with EE up to 128, where TopK saturates), and specialization (experts cluster by data domain such as ArXiv or GitHub tokens). ReMoE’s throughput is within ±\pm3% of TopK-MoE, with no practical speed loss (Wang et al., 2024).

3. Remoe: Serverless-Efficient MoE Inference System

Remoe is a system-level solution targeting high-efficiency, low-cost MoE inference in serverless computing. In this context, MoE LLMs incur prohibitive serverless costs if naively deployed: charged memory-time products scale with the total expert parameter footprint, even when only a tiny subset of experts are used by any input (Liu et al., 21 Dec 2025).

Remoe introduces a heterogeneous deployment: non-expert modules (e.g., tokenizers, attention, gating) execute on GPU; frequently used (“local”) experts are hosted in the same CPU process; and low-frequency (“remote”) experts are split into minimal serverless functions, each on CPU. Remote functions are invoked on demand and in parallel. Incoming tokens arrive at the main function, undergo gating, and are split between local and remote experts as needed per layer. The system merges outputs and continues decoding per-token (Liu et al., 21 Dec 2025).

Cost/latency optimization is achieved by three core techniques:

3.1 Similar Prompts Searching (SPS)

SPS predicts expert activation for new requests by matching prompt embeddings to a tree of historical prompts using soft-cosine similarity. The predicted per-layer, per-expert activation matrix S~\widetilde S optimizes proactive memory/resource allocation, avoiding unnecessary cold starts or memory overhead (Liu et al., 21 Dec 2025).

3.2 Main Model Pre-Allocation (MMP)

MMP uses probabilistic upper-bounds (Hoeffding’s) to determine the minimal main model memory capacity that meets Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT) service-level objectives (SLOs) under adversarial expert usage (Liu et al., 21 Dec 2025).

3.3 Joint Memory and Replica Optimization

To minimize cost while meeting SLOs, optimal memory allocation to remote expert functions is solved via Lagrangian duality and KKT conditions; replicas are assigned using the Longest Processing Time (LPT) heuristic for efficient balancing (Liu et al., 21 Dec 2025). All memory and replica assignments are optimized at each batch.

Implementation uses Kubernetes pods, with careful placement to minimize inter-container latency, and gRPC for token payload transfer. The system supports large LLMs (e.g., Deepseek-v2-lite, GPT2-moe) and avoids external storage for intermediate results.

4. Empirical Performance and Scalability

In controlled benchmarks, Remoe achieves up to 57% cost reduction and 47% lower cold-start latency compared to state-of-the-art CPU, GPU, and naïve mixed baselines. SPS prediction outperforms traditional clustering by 10–20% in Jensen–Shannon divergence. Scaling analysis shows stable costs and negligible optimization delay relative to inference time (<0.2 s for SPS tree building, <0.1 s for joint resource solve) (Liu et al., 21 Dec 2025).

This architecture allows real-world serverless deployment of MoE LLMs without incurring full model memory or cold-start overhead, enabling bursty, multi-tenant LLM inference.

5. Comparative Table: ReMoE vs. Remoe

Aspect ReMoE (ReLU-MoE) (Wang et al., 2024) Remoe (Serverless MoE) (Liu et al., 21 Dec 2025)
Routing mechanism ReLU-based, continuous, fully differentiable Gating as in model; SPS for activation prediction
Deployment context Neural network training and inference Large-scale serverless inference environment
Key novelty No TopK/Softmax, L1L_1-regulated sparsity Main/remote memory partitioning, SPS/MMP/Joint opt
Scalability Proven for EE up to 128, stable training Serverless scale-out, parallel remote experts
Gains vs. baselines +0.3–0.6% accuracy, superior capacity scaling 57% lower cost, 47% lower cold-start latency

6. Context and Impact

The ReMoE architecture provides a simple, effective means of making MoE routers fully differentiable, thereby stabilizing expert assignments and opening the path to large expert panels and fine-grained task/domain specialization without flip-induced training instability. The Remoe system operationalizes MoEs for serverless LLM inference at scale, resolving the hitherto unsolved challenge of cost, cold-start, and resource balancing for sparse activation models in elastic environments.

A plausible implication is that these techniques, by addressing core challenges in both model design and distributed inference, will underpin future deployment standards for large-scale, cost-sensitive and dynamically specialized MoE-based AI systems across cloud-native and resource-constrained platforms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Remoe.