Remoe: Differentiable & Serverless MoE
- Remoe is a dual concept incorporating both a fully differentiable ReMoE architecture and a serverless MoE inference system for elastic LLM deployment.
- ReMoE employs ReLU-based routing with adaptive L1 regularization to enable continuous gradient flow and improved specialization across experts.
- The Remoe system optimizes resource usage through SPS, MMP, and joint memory-replica techniques, achieving significant cost and latency reductions.
Remoe
Remoe refers to two distinct but technically advanced concepts in recent academic literature: (1) a fully differentiable Mixture-of-Experts (MoE) architecture termed “ReMoE” for efficient neural network inference, and (2) a heterogeneous, serverless-enabled MoE inference system termed “Remoe” tailored for LLMs deployed in elastic computing environments. Both share the core focus of MoE neural network architectures, but differ in their technical contributions: the former targets differentiability and scalability via ReLU-based routing, while the latter introduces cost-efficient, algorithmically optimized MoE serving in distributed, serverless contexts.
1. Mixture-of-Experts in Neural Networks
A Mixture-of-Experts (MoE) neural network contains parallel “expert” subnets in selected layers. For each input token , only a subset (typically ) of experts receives ’s representation, with a gating/router function determining which experts participate. This architecture enables aggressive sparsification and scale-out of compute and parameters with sub-linear runtime growth relative to model size.
Conventional MoE designs route by a TopK+Softmax mechanism: the router computes logits for each expert, selects the largest, normalizes via Softmax, and gates those experts. However, TopK gating introduces sharp nondifferentiabilities and unpredictable token-to-expert assignments as increases, destabilizing training and impairing scalability (Wang et al., 2024).
2. ReMoE: ReLU-Routing Differentiable Mixture-of-Experts
ReMoE (Fully Differentiable Mixture-of-Experts with ReLU Routing) replaces the non-differentiable TopK router with a ReLU-based linear router. For each token in layer (), routing weights are computed as:
where and . Each expert is activated only if its pre-activation is positive, providing sparsity without discrete selection (Wang et al., 2024).
To ensure controlled computational budget, an adaptive regularizer penalizes the sum of router activations, with a feedback-driven multiplier adjusted so the empirical sparsity matches the target:
A load-balancing term reweights expert activation penalties, avoiding expert collapse. Crucially, this architecture supports continuous gradient flow throughout the gating path—unlike TopK’s hard routing, which introduces training discontinuities at token-expert assignment boundaries.
Dynamic compute allocation arises as each token can route to a variable number of experts: common (semantically simple) tokens often activate fewer, rare tokens more. The penalty holds the average expert usage close to . This results in (statistically) similar flops/token as TopK but net improved domain specialization and throughput (Wang et al., 2024).
Empirical validation covers models with M–978M parameters and –128 experts. ReMoE consistently beats TopK in zero-shot accuracy (+0.36%), scaling to higher expert counts (declining validation loss with up to 128, where TopK saturates), and specialization (experts cluster by data domain such as ArXiv or GitHub tokens). ReMoE’s throughput is within 3% of TopK-MoE, with no practical speed loss (Wang et al., 2024).
3. Remoe: Serverless-Efficient MoE Inference System
Remoe is a system-level solution targeting high-efficiency, low-cost MoE inference in serverless computing. In this context, MoE LLMs incur prohibitive serverless costs if naively deployed: charged memory-time products scale with the total expert parameter footprint, even when only a tiny subset of experts are used by any input (Liu et al., 21 Dec 2025).
Remoe introduces a heterogeneous deployment: non-expert modules (e.g., tokenizers, attention, gating) execute on GPU; frequently used (“local”) experts are hosted in the same CPU process; and low-frequency (“remote”) experts are split into minimal serverless functions, each on CPU. Remote functions are invoked on demand and in parallel. Incoming tokens arrive at the main function, undergo gating, and are split between local and remote experts as needed per layer. The system merges outputs and continues decoding per-token (Liu et al., 21 Dec 2025).
Cost/latency optimization is achieved by three core techniques:
3.1 Similar Prompts Searching (SPS)
SPS predicts expert activation for new requests by matching prompt embeddings to a tree of historical prompts using soft-cosine similarity. The predicted per-layer, per-expert activation matrix optimizes proactive memory/resource allocation, avoiding unnecessary cold starts or memory overhead (Liu et al., 21 Dec 2025).
3.2 Main Model Pre-Allocation (MMP)
MMP uses probabilistic upper-bounds (Hoeffding’s) to determine the minimal main model memory capacity that meets Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT) service-level objectives (SLOs) under adversarial expert usage (Liu et al., 21 Dec 2025).
3.3 Joint Memory and Replica Optimization
To minimize cost while meeting SLOs, optimal memory allocation to remote expert functions is solved via Lagrangian duality and KKT conditions; replicas are assigned using the Longest Processing Time (LPT) heuristic for efficient balancing (Liu et al., 21 Dec 2025). All memory and replica assignments are optimized at each batch.
Implementation uses Kubernetes pods, with careful placement to minimize inter-container latency, and gRPC for token payload transfer. The system supports large LLMs (e.g., Deepseek-v2-lite, GPT2-moe) and avoids external storage for intermediate results.
4. Empirical Performance and Scalability
In controlled benchmarks, Remoe achieves up to 57% cost reduction and 47% lower cold-start latency compared to state-of-the-art CPU, GPU, and naïve mixed baselines. SPS prediction outperforms traditional clustering by 10–20% in Jensen–Shannon divergence. Scaling analysis shows stable costs and negligible optimization delay relative to inference time (<0.2 s for SPS tree building, <0.1 s for joint resource solve) (Liu et al., 21 Dec 2025).
This architecture allows real-world serverless deployment of MoE LLMs without incurring full model memory or cold-start overhead, enabling bursty, multi-tenant LLM inference.
5. Comparative Table: ReMoE vs. Remoe
| Aspect | ReMoE (ReLU-MoE) (Wang et al., 2024) | Remoe (Serverless MoE) (Liu et al., 21 Dec 2025) |
|---|---|---|
| Routing mechanism | ReLU-based, continuous, fully differentiable | Gating as in model; SPS for activation prediction |
| Deployment context | Neural network training and inference | Large-scale serverless inference environment |
| Key novelty | No TopK/Softmax, -regulated sparsity | Main/remote memory partitioning, SPS/MMP/Joint opt |
| Scalability | Proven for up to 128, stable training | Serverless scale-out, parallel remote experts |
| Gains vs. baselines | +0.3–0.6% accuracy, superior capacity scaling | 57% lower cost, 47% lower cold-start latency |
6. Context and Impact
The ReMoE architecture provides a simple, effective means of making MoE routers fully differentiable, thereby stabilizing expert assignments and opening the path to large expert panels and fine-grained task/domain specialization without flip-induced training instability. The Remoe system operationalizes MoEs for serverless LLM inference at scale, resolving the hitherto unsolved challenge of cost, cold-start, and resource balancing for sparse activation models in elastic environments.
A plausible implication is that these techniques, by addressing core challenges in both model design and distributed inference, will underpin future deployment standards for large-scale, cost-sensitive and dynamically specialized MoE-based AI systems across cloud-native and resource-constrained platforms.