Minimind-MoE LLM: Mobile Inference Optimization

Updated 21 December 2025

Minimind-MoE LLM is a suite of strategies that optimizes Mixture-of-Experts models for on-device inference using cache-aware expert selection.
It employs innovative re-ranking algorithms to promote cached experts, thereby significantly boosting token throughput on resource-constrained devices.
Quantization and selective DRAM locking enable sequential, batch-size 1 inference on mobile hardware, doubling throughput compared to standard LRU caching.

Minimind-MoE LLM refers to a suite of strategies and empirical results for efficiently running Mixture-of-Experts (MoE) LLMs on resource-constrained mobile devices. The approach centers explicitly on cache-aware expert selection and routing, designed to maximize inference throughput when only a fraction of expert weights can reside in fast-access DRAM. The design enables on-device, sequential (batch size 1) inference using off-the-shelf MoE architectures, such as DeepSeek-V2-Lite, Qwen1.5-MoE-A2.7B, Phi-3.5-MoE, and Mixtral-8×7B, without requiring model retraining (Skliar et al., 2024).

1. Model Structure and MoE Expert Selection

The method builds on standard Transformer-MoE architectures, where, at each MoE layer, a router network $G$ generates routing logits $\mathbf{z} \in \mathbb{R}^N$ for $N$ candidate experts. The softmax-transformed weights $\mathbf{w} = \mathrm{softmax}(\mathbf{z})$ represent the likelihood of selecting each expert. Only the top- $K$ experts with the highest weights are activated per input token: $\mathbf{y} = \sum_{i \in \mathbf{r}[:K]} w_i E_i(\mathbf{x})$ where $E_i$ are expert networks, and $\mathbf{r}[:K]$ is the index vector for the top- $K$ experts. The models used for empirical evaluation are summarized below:

Model	Total Params (B)	Active Params per Token (B)	# Experts	Params/Expert (M)	Routing
DeepSeek-V2-Lite	15.9	2.8	64	8.6	Top-6(+2)
Qwen1.5-MoE-A2.7B	14.3	2.7	60	8.6	Top-4(+4)
Phi-3.5-MoE	41.9	6.6	16	79	Top-2
Mixtral-8×7B	46.7	13.0	8	176	Top-2

This structure enables high parameter-count models with low per-token compute, yet poses unique challenges when expert weights cannot all fit in DRAM (Skliar et al., 2024).

2. Cache-Aware Routing and Expert Reuse

Experts too large for full DRAM residency are paged into memory on demand from significantly slower flash storage. Each MoE layer maintains an LRU-managed cache for expert weights. After router selection, a cache miss requires loading the corresponding expert from flash, triggering eviction if the cache is full. The cache hit rate for a token is: $\mathrm{hit\_rate} = \frac{|\,\mathbf{r}[:K] \cap C\,|}{K}$ For typical LRU policy with 50% of experts cached, observed hit rates are low (20–40%), degrading throughput.

The Minimind-MoE method introduces cache-aware router re-ranking algorithms that bias expert selection toward cached experts without altering the final output computation. Three procedures are defined:

Max-Rank promotion: Moves up to $M$ top-ranked, currently cached experts to the front of the selection list while always preserving the top- $J$ original choices.
Cumulative-Sum thresholding: Promotes cached experts within the smallest prefix accumulating at least a probability mass $p$ , upholding the top- $J$ picks.
Cache Prior reranking: Adjusts router logits with an additive prior for cached experts:

$\mathbf{z}' = \mathbf{z} + \lambda \Delta_{\mathrm{avg}}\, \mathbf{1}_{\text{cached}\,\cup\,\{\text{top-}J\}}$

where $\Delta_{\mathrm{avg}}$ estimates the prevalent logit range and $\lambda$ tunes bias-accuracy tradeoff. Selection is performed on $\mathbf{z}'$ but softmax on $\mathbf{z}$ is used for the forward pass.

These strategies increase cache hit rates and exploit temporal reuse in sequential token generation (Skliar et al., 2024).

3. Memory Management and Quantization

MoE expert weights are large: e.g., Qwen1.5-MoE’s 60 experts at 8.6M parameters each total approximately 258 MB (in 4-bit quantization), which precludes storing all in DRAM on consumer mobile devices (e.g., 10–16 GB RAM, of which only a portion is available for the model).

Quantization techniques (4-bit and 8-bit weights, using llama-cpp), combined with selective DRAM locking (via mlock), allow a cache to be filled with a subset of expert weights—while others remain on flash. Only the cache is populated with quantized weights. This approach ensures that the active working set fits the device constraints and that cache residency guarantees are maintained throughout inference. Paging and memory lock optimizations further enhance throughput (Skliar et al., 2024).

4. On-Device Implementation

Empirical evaluations are performed on two Android devices equipped with Qualcomm Snapdragon SoCs, running Android 14. Device A has 12 GB total RAM, of which 10 GB is allocated to the model with a cache size of 45 experts per layer (4-bit quantization). Device B uses 16 GB RAM with 30 experts per layer cached (8-bit quantization). Inference runs entirely on CPU using llama-cpp, with modifications to support expert LRU caching and the cache-aware Prior algorithm.

Parallelization of flash→DRAM transfers with matrix multiplications (via SIMD/NEON) is suggested as an implementation optimization, though not detailed in the source. The technique explicitly targets batch size 1, sequential inference—typical for real-world mobile application use cases (Skliar et al., 2024).

5. Experimental Results and Ablation Findings

Performance is quantified via perplexity (WikiText-2), accuracy (MMLU, GSM8K), cache miss-rate, and token throughput (10-run measurement). The cache Prior reranking consistently achieves approximately 2× speedup in token rate relative to an optimal LRU baseline for cache management. Table 4 in the paper details substantial reductions in cache miss-rate and increases in expert lifetime for all testbeds:

Model	Cache: LRU Lifetime (tokens), Miss-rate (%)	Prior Lifetime, Miss-rate (%)
Qwen1.5-MoE	22, 35	111, 7
DeepSeek	20, 28	188, 3
Phi-3.5	22, 22	55, 9
Mixtral	5, 40	10, 21

Ablation studies indicate:

Top-J sensitivity: Baselines degrade if the always-preserved expert set is too small; Prior method remains robust for $J \in [0,2]$ .
$\Delta_{\mathrm{avg}}$ estimation: Running average matches full-dataset estimates, providing out-of-domain robustness.
Insertion policy: The order in which LRU cache adds experts (top-first vs. bottom-first) minimally affects performance.
Learned vs. fixed prior: Learned small MLP priors offer negligible improvement over simple additive bias (Skliar et al., 2024).

6. Practical Guidelines, Limitations, and Extensions

Cache size should be calibrated such that the total memory footprint (experts, activations, and kv-cache) fits within the device’s physical limitations; empirically, 50–75% of experts cached is a reasonable starting point. The "guaranteed top- $J$ " trick allows the original router's top picks to pass through unmodified, preserving accuracy even while increasing cache hits. The single hyperparameter $\lambda \in [0,1]$ enables smooth adjustment of accuracy versus cache preference using a small in-domain calibration set.

The approach is model-agnostic and training-free but does require maintaining per-layer running averages for $\Delta_{\mathrm{avg}}$ . Latency improvements are significant primarily in memory-bound, low-batch scenarios; larger batches or GPU settings reduce the relative benefit. Potential future work includes speculative look-ahead routing, integration with NPU/DSP hardware, and adaptation to extreme domain shifts (which may necessitate recalibration of statistics).

In summary, Minimind-MoE LLM introduces a principled, implementation-oriented methodology for re-ranking MoE router outputs to maximize DRAM cache locality, thereby doubling effective throughput on mobile devices with negligible or positive effects on final accuracy (Skliar et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minimind-MoE LLM.