Minimind-MoE LLM: Mobile Inference Optimization
- Minimind-MoE LLM is a suite of strategies that optimizes Mixture-of-Experts models for on-device inference using cache-aware expert selection.
- It employs innovative re-ranking algorithms to promote cached experts, thereby significantly boosting token throughput on resource-constrained devices.
- Quantization and selective DRAM locking enable sequential, batch-size 1 inference on mobile hardware, doubling throughput compared to standard LRU caching.
Minimind-MoE LLM refers to a suite of strategies and empirical results for efficiently running Mixture-of-Experts (MoE) LLMs on resource-constrained mobile devices. The approach centers explicitly on cache-aware expert selection and routing, designed to maximize inference throughput when only a fraction of expert weights can reside in fast-access DRAM. The design enables on-device, sequential (batch size 1) inference using off-the-shelf MoE architectures, such as DeepSeek-V2-Lite, Qwen1.5-MoE-A2.7B, Phi-3.5-MoE, and Mixtral-8×7B, without requiring model retraining (Skliar et al., 27 Nov 2024).
1. Model Structure and MoE Expert Selection
The method builds on standard Transformer-MoE architectures, where, at each MoE layer, a router network generates routing logits for candidate experts. The softmax-transformed weights represent the likelihood of selecting each expert. Only the top- experts with the highest weights are activated per input token: where are expert networks, and is the index vector for the top- experts. The models used for empirical evaluation are summarized below:
| Model | Total Params (B) | Active Params per Token (B) | # Experts | Params/Expert (M) | Routing |
|---|---|---|---|---|---|
| DeepSeek-V2-Lite | 15.9 | 2.8 | 64 | 8.6 | Top-6(+2) |
| Qwen1.5-MoE-A2.7B | 14.3 | 2.7 | 60 | 8.6 | Top-4(+4) |
| Phi-3.5-MoE | 41.9 | 6.6 | 16 | 79 | Top-2 |
| Mixtral-8×7B | 46.7 | 13.0 | 8 | 176 | Top-2 |
This structure enables high parameter-count models with low per-token compute, yet poses unique challenges when expert weights cannot all fit in DRAM (Skliar et al., 27 Nov 2024).
2. Cache-Aware Routing and Expert Reuse
Experts too large for full DRAM residency are paged into memory on demand from significantly slower flash storage. Each MoE layer maintains an LRU-managed cache for expert weights. After router selection, a cache miss requires loading the corresponding expert from flash, triggering eviction if the cache is full. The cache hit rate for a token is: For typical LRU policy with 50% of experts cached, observed hit rates are low (20–40%), degrading throughput.
The Minimind-MoE method introduces cache-aware router re-ranking algorithms that bias expert selection toward cached experts without altering the final output computation. Three procedures are defined:
- Max-Rank promotion: Moves up to top-ranked, currently cached experts to the front of the selection list while always preserving the top- original choices.
- Cumulative-Sum thresholding: Promotes cached experts within the smallest prefix accumulating at least a probability mass , upholding the top- picks.
- Cache Prior reranking: Adjusts router logits with an additive prior for cached experts:
where estimates the prevalent logit range and tunes bias-accuracy tradeoff. Selection is performed on but softmax on is used for the forward pass.
These strategies increase cache hit rates and exploit temporal reuse in sequential token generation (Skliar et al., 27 Nov 2024).
3. Memory Management and Quantization
MoE expert weights are large: e.g., Qwen1.5-MoE’s 60 experts at 8.6M parameters each total approximately 258 MB (in 4-bit quantization), which precludes storing all in DRAM on consumer mobile devices (e.g., 10–16 GB RAM, of which only a portion is available for the model).
Quantization techniques (4-bit and 8-bit weights, using llama-cpp), combined with selective DRAM locking (via mlock), allow a cache to be filled with a subset of expert weights—while others remain on flash. Only the cache is populated with quantized weights. This approach ensures that the active working set fits the device constraints and that cache residency guarantees are maintained throughout inference. Paging and memory lock optimizations further enhance throughput (Skliar et al., 27 Nov 2024).
4. On-Device Implementation
Empirical evaluations are performed on two Android devices equipped with Qualcomm Snapdragon SoCs, running Android 14. Device A has 12 GB total RAM, of which 10 GB is allocated to the model with a cache size of 45 experts per layer (4-bit quantization). Device B uses 16 GB RAM with 30 experts per layer cached (8-bit quantization). Inference runs entirely on CPU using llama-cpp, with modifications to support expert LRU caching and the cache-aware Prior algorithm.
Parallelization of flash→DRAM transfers with matrix multiplications (via SIMD/NEON) is suggested as an implementation optimization, though not detailed in the source. The technique explicitly targets batch size 1, sequential inference—typical for real-world mobile application use cases (Skliar et al., 27 Nov 2024).
5. Experimental Results and Ablation Findings
Performance is quantified via perplexity (WikiText-2), accuracy (MMLU, GSM8K), cache miss-rate, and token throughput (10-run measurement). The cache Prior reranking consistently achieves approximately 2× speedup in token rate relative to an optimal LRU baseline for cache management. Table 4 in the paper details substantial reductions in cache miss-rate and increases in expert lifetime for all testbeds:
| Model | Cache: LRU Lifetime (tokens), Miss-rate (%) | Prior Lifetime, Miss-rate (%) |
|---|---|---|
| Qwen1.5-MoE | 22, 35 | 111, 7 |
| DeepSeek | 20, 28 | 188, 3 |
| Phi-3.5 | 22, 22 | 55, 9 |
| Mixtral | 5, 40 | 10, 21 |
Ablation studies indicate:
- Top-J sensitivity: Baselines degrade if the always-preserved expert set is too small; Prior method remains robust for .
- estimation: Running average matches full-dataset estimates, providing out-of-domain robustness.
- Insertion policy: The order in which LRU cache adds experts (top-first vs. bottom-first) minimally affects performance.
- Learned vs. fixed prior: Learned small MLP priors offer negligible improvement over simple additive bias (Skliar et al., 27 Nov 2024).
6. Practical Guidelines, Limitations, and Extensions
Cache size should be calibrated such that the total memory footprint (experts, activations, and kv-cache) fits within the device’s physical limitations; empirically, 50–75% of experts cached is a reasonable starting point. The "guaranteed top-" trick allows the original router's top picks to pass through unmodified, preserving accuracy even while increasing cache hits. The single hyperparameter enables smooth adjustment of accuracy versus cache preference using a small in-domain calibration set.
The approach is model-agnostic and training-free but does require maintaining per-layer running averages for . Latency improvements are significant primarily in memory-bound, low-batch scenarios; larger batches or GPU settings reduce the relative benefit. Potential future work includes speculative look-ahead routing, integration with NPU/DSP hardware, and adaptation to extreme domain shifts (which may necessitate recalibration of statistics).
In summary, Minimind-MoE LLM introduces a principled, implementation-oriented methodology for re-ranking MoE router outputs to maximize DRAM cache locality, thereby doubling effective throughput on mobile devices with negligible or positive effects on final accuracy (Skliar et al., 27 Nov 2024).