Dynamic Bit-Sliced Caching for MoE Models
- Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme that partitions expert weight tensors into MSB and LSB slices for efficient on-device MoE inference.
- It employs a precision-on-demand mechanism and predictive cache warmup to dynamically allocate high precision only when needed, maintaining low miss rates under tight DRAM constraints.
- Empirical evaluations show that DBSC significantly reduces decode energy and latency, achieving near-maximal accuracy while minimizing costly Flash memory accesses.
Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme for on-device inference with large Mixture-of-Experts (MoE) models under stringent miss-rate constraints. DBSC operates at the granularity of quantized bit-slices of expert weight tensors, caching the most critical precision slices to maximize effective expert capacity and reduce cache miss penalties. Integrated with a precision-on-demand mechanism and specialized quantization, DBSC enables energy- and latency-efficient deployment of MoE models within limited DRAM budgets, dramatically reducing high-latency Flash accesses while preserving near-maximal inference accuracy (Choi et al., 15 Dec 2025).
1. Problem Setting and Motivation
Large-scale MoE LLMs feature tens of billions of expert parameters, often exceeding the few gigabytes available in on-device DRAM. Standard deployments partition experts between DRAM (for fast, low-energy access) and Flash storage (10–100× slower, 50–100× more energy per bit). Even moderate cache miss rates (e.g., 10–30%) incur significant energy and latency penalties, rapidly dominating inference costs and rendering on-device serving impractical without more sophisticated cache control. For practical inference, the instantaneous cache miss rate
must typically remain below 5%.
2. Bit-Slice Caching Principles and Workflow
Each expert's quantized weight tensor is partitioned into two bit-slices: a most significant bits (MSB) slice of bits and a least significant bits (LSB) slice of bits. Caching only the MSB slice permits a low-bit approximation of expert weights, sufficient for non-critical experts, while a subset of experts can be recombined with the LSB slice for full precision. This slice-level approach enables more fine-grained use of cache, increasing the number of distinct expert representations resident in DRAM and boosting the cache hit probability within a strict memory budget.
Memory Footprint Formulation
Given hidden dimension and experts, the DRAM footprint of a single -bit slice is: If a fraction of cached experts retain high precision and retain low precision, the average bits per expert and effective expert capacity 0 with DRAM budget 1 are: 2
Slice-Level Cache Management Algorithm
The DBSC eviction and admission protocol operates as follows:
- MSB slices are managed via standard LRU.
- LSB slices have lowest priority and are evicted first when capacity pressure arises.
- After each batch, if the global miss rate exceeds target, low-priority slices are evicted until the miss rate is satisfactory.
Sample pseudocode: 6
3. Precision-On-Demand Mechanism
DBSC leverages the typically steep distribution of gating scores by assigning precision dynamically on a per-token basis:
- All selected experts fetch their MSB slice.
- Only experts whose gating score surpasses a token-specific threshold fetch the LSB slice, enabling full precision. The optimization objective is to maximize expected accuracy 3 subject to memory 4 and miss-rate constraint 5: 6 This approach ensures that the bulk of experts operate at lower precision, conserving memory and energy, while critical experts maintain full expressiveness.
4. Calibration-Free Asymmetric Matryoshka Quantization (AMAT)
To permit seamless mixed-precision expert caching, DBSC employs Calibration-Free Asymmetric Matryoshka Quantization (AMAT). AMAT enables truncation-based extraction of low- and high-bit slices from a single quantized tensor and its zero-point without duplicate storage or additional calibration overhead.
For high-bitwidth 7 and low-bitwidth 8: 9
0
Bit-slice composition and value dequantization proceed as: 1
2
This construction ensures exact compatibility between low- and high-bit slices, simplifying both cache management and hardware implementation.
5. Predictive Cache Warmup
Single-batch inference epochs include a prefill phase (broad expert access via parallelism) and a decode phase (narrow, frequent reuse of a small expert subset). The Predictive Cache Warmup (PCW) mechanism exploits the empirical correlation of “hot” experts between prefill and early decode. PCW records per-slice access counts 3 during prefill, then, at the prefill-to-decode transition:
- Evicts LSB slices with the smallest 4.
- Evicts MSB slices in ascending order of 5 until DRAM constraints are met.
Pseudocode: 7 This preemption creates a cache population closely matched to anticipated decode-stage access patterns, reducing cold misses and access costs.
6. Empirical Evaluation and Performance
DBSC and AMAT, as part of SliceMoE, were evaluated on DeepSeek-V2-Lite (160 experts) and Qwen1.5-MoE-A2.7B (240 experts) using a GSM8K 5-shot benchmark. The hardware: XPU (1 GHz, 8192 8-bit PEs), 8 GB LPDDR4 DRAM, and 128 GB UFS 3.1 Flash. Key quantitative results:
- Decode energy reduction by up to 2.37× (DeepSeek-V2-Lite) and 2.85× (Qwen1.5-MoE-A2.7B).
- Decode latency improvement up to 1.81× and 1.64×, respectively.
- DBSC+AMAT achieves accuracy near the high-bit reference along the Pareto frontier for miss rate and energy.
- Predictive Cache Warmup provides an additional up to 2.31× energy reduction and 1.96× speed-up over a cold (empty) cache (Choi et al., 15 Dec 2025).
7. Deployment Considerations and Future Directions
DBSC requires hardware capable of fetching and combining bit-slices at inference time and introduces more intricate per-slice metadata and LRU tracking. Deployments in extremely tight memory regimes or with low-bitwidth (<2 bits) slices may see accuracy degradation. Gate threshold and prefill–decode correlation parameters may require model-specific tuning.
Prospective research avenues include extending DBSC to multi-slice (beyond MSB/LSB) hierarchies, joint optimization of expert routing and slice allocation, native hardware support for bit-slice streaming, and adaptive on-device learning of cache hotness patterns to suppress cold misses even further (Choi et al., 15 Dec 2025).