SliceMoE: Fine-Grained MoE Approaches
- SliceMoE is a suite of techniques that decomposes token embeddings into slices for targeted routing and specialization in mixture-of-experts models.
- It applies slice-level regularization and cross-slice dropout to promote balanced expert utilization and robust training compared to conventional token-level methods.
- SliceMoE integrates bit-sliced caching and quantization strategies to accelerate on-device inference while reducing energy usage and maintaining high accuracy.
SliceMoE encompasses a family of architectural and systems techniques in Mixture-of-Experts (MoE) models characterized by fine-grained “slice”-level decomposition for improved routing, balanced computation, targeted specialization, and efficient deployment. This article surveys the principal variants: (1) slice-routing for sub-token feature partitioning in Transformers (Vejendla, 5 Oct 2025), (2) slice-aware mixture of attentions for supervised subset targeting (Wang et al., 2021), and (3) bit-sliced expert caching for efficient inference under hardware constraints (Choi et al., 15 Dec 2025). All focus on overcoming the inefficiencies and bottlenecks that arise with conventional token-level MoE at model, optimization, and runtime levels.
1. Fine-Grained Routing via Embedding Slices
Traditional token-level MoE assigns whole token embeddings to experts, leading to capacity bottlenecks, pathologies in expert load, and constrained specialization. SliceMoE (Vejendla, 5 Oct 2025) introduces feature-wise routing, representing each token embedding as a concatenation of contiguous equal-length “slices”: A lightweight shared router MLP computes per-slice logits over experts, from which top- experts for each slice are selected via masking and re-normalization. Each expert processes a weighted version of its assigned slices: Outputs are reassembled to form the full token output. This design delocalizes routing: within a batch of tokens and slices, up to independent routing decisions are made, mitigating expert underutilization and bottlenecks typical at the token level.
2. Slice-Level Regularization and Training Protocols
To enforce expert utilization and robust specialization, SliceMoE introduces two key regularizers:
- Slice-level capacity loss penalizes variance in expert slice assignments, using a squared coefficient of variation,
promoting balanced load across experts at the slice level.
- Cross-slice dropout randomly zeros a fraction of selected during training, then re-normalizes, encouraging the router to explore alternative expert-slice pairings and preventing premature specialization.
Inference is kept efficient via batched/fused GEMM kernels on expert-slice tensors, which improve memory locality and computational throughput even as granularity increases from token to slice.
3. Performance and Empirical Results
SliceMoE demonstrates consistent advances over both dense and token-level MoE baselines:
| Model | Perplexity (WikiText-103) | BLEU (WMT En-De) | AG News Accuracy | Load Entropy (ELE) | Inference Speed-up |
|---|---|---|---|---|---|
| Dense | ~31.0 | ~27.6 | 0.918 | 1.0 (no MoE) | 1.0× |
| TokenMoE | ~29.1 | ~28.2 | 0.912 | ~0.88 | — |
| SliceMoE | ~25.4 | ~29.8 | 0.925 | ~0.97 | up to 1.7× |
Peak performance occurs for slices and experts per slice. Contiguous slicing outperforms shuffled random partitions, suggesting locality in feature space is beneficial for specialization (Vejendla, 5 Oct 2025).
4. SliceMoE for Data Subset Specialization
A distinct “SliceMoE” arises in the context of slice-aware mixture of attentions (MoA), targeting data-centric “slices” (subsets defined by indicator functions) (Wang et al., 2021). Given predefined or weakly supervised slices, the architecture comprises:
- Slice indicators: small feed-forward networks estimating slice membership probabilities for each input;
- Slice experts: generating slice-specific representations;
- Mixture of attentions: dual-channel fusion of membership- and data-driven attention over expert outputs, combined as (membership) and (dot-product with prototypes), fused via element-wise addition or multiplication.
Training minimizes the sum of indicator, expert, and final prediction cross-entropy losses. Empirically, slice-aware MoA improves critical slice accuracy (e.g., up to +12% F1 in slices on CoLA) without degrading global performance. Multiplicative fusion and hard sampling yield higher slice lift but increased variance.
5. Bit-Sliced Caching for Efficient MoE Inference
The deployment variant of SliceMoE (Choi et al., 15 Dec 2025) addresses the challenges of on-device MoE inference under strict memory and energy budgets. The core contributions are:
- Dynamic Bit-Sliced Caching (DBSC): Each expert weight matrix is partitioned into MSB (higher precision) and LSB (lower precision) “slices.” At runtime, critical experts receive both slices (full precision), while non-critical ones receive only MSB (reduced precision), conforming to a total DRAM slice budget.
- Calibration-Free Asymmetric Matryoshka Quantization (AMAT): Enables seamless packing of high-bit and low-bit versions into a single array with zero-point/scale compatibility, enabling granular bit-slice reuse without memory duplication.
- Predictive Cache Warmup (PCW): Utilizes expert access statistics during the prefill (prompt) phase to optimally pre-load hot slices, reducing early-stage cache misses and associated energy/latency costs.
Evaluated on DeepSeek-V2-Lite and Qwen1.5-MoE-A2.7B, the system achieves up to ~2.8× reductions in decode-stage energy and ~1.8× speed-up compared to high-bit-only cache baselines, with <1% relative accuracy loss.
6. Ablations and Practical Trade-offs
Key empirical observations:
- Increasing slice count enhances utilization until an optimum (typically ), after which diminishing returns or overhead may occur.
- best balances expert utilization and computational cost.
- Contiguous feature slicing is superior to non-contiguous/random, indicating aligned subspaces are critical for expert specialization (Vejendla, 5 Oct 2025).
- In hardware deployment, DBSC’s granularity introduces some LRU/cache-management overhead; AMAT low-precision paths can incur ≈10% isolated accuracy loss but are mitigated by hybrid precision; PCW is effective when prefill expert distributions correlate with decode distributions (Choi et al., 15 Dec 2025).
7. Interpretability, Limitations, and Future Directions
SliceMoE architectures yield interpretable expert specializations, often differentiating syntactic from semantic subspaces at the slice level (Vejendla, 5 Oct 2025). In the inspection of routing and attention distributions, complementarity between membership-based and data-driven mechanisms has been observed (Wang et al., 2021).
Limitations include dependence on batch-level routing statistics, need for pre-defined slice functions in structured MoA, and potential management overhead in system-level caching. Potential extensions include finer granularity (e.g., >2 bit slices), per-layer adaptive budgets, hardware acceleration for slice-level caching, and co-optimization of routers with caching strategies (Choi et al., 15 Dec 2025).
SliceMoE represents a paradigm characterized by decomposing model, data, or storage at the “slice” level to achieve balanced scaling, robust specialization, and hardware efficiency across both architectural and deployment axes.