Dynamic Bit-Sliced Caching for MoE Models
- Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme that partitions expert weight tensors into MSB and LSB slices for efficient on-device MoE inference.
- It employs a precision-on-demand mechanism and predictive cache warmup to dynamically allocate high precision only when needed, maintaining low miss rates under tight DRAM constraints.
- Empirical evaluations show that DBSC significantly reduces decode energy and latency, achieving near-maximal accuracy while minimizing costly Flash memory accesses.
Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme for on-device inference with large Mixture-of-Experts (MoE) models under stringent miss-rate constraints. DBSC operates at the granularity of quantized bit-slices of expert weight tensors, caching the most critical precision slices to maximize effective expert capacity and reduce cache miss penalties. Integrated with a precision-on-demand mechanism and specialized quantization, DBSC enables energy- and latency-efficient deployment of MoE models within limited DRAM budgets, dramatically reducing high-latency Flash accesses while preserving near-maximal inference accuracy (Choi et al., 15 Dec 2025).
1. Problem Setting and Motivation
Large-scale MoE LLMs feature tens of billions of expert parameters, often exceeding the few gigabytes available in on-device DRAM. Standard deployments partition experts between DRAM (for fast, low-energy access) and Flash storage (10–100× slower, 50–100× more energy per bit). Even moderate cache miss rates (e.g., 10–30%) incur significant energy and latency penalties, rapidly dominating inference costs and rendering on-device serving impractical without more sophisticated cache control. For practical inference, the instantaneous cache miss rate
must typically remain below 5%.
2. Bit-Slice Caching Principles and Workflow
Each expert's quantized weight tensor is partitioned into two bit-slices: a most significant bits (MSB) slice of bits and a least significant bits (LSB) slice of bits. Caching only the MSB slice permits a low-bit approximation of expert weights, sufficient for non-critical experts, while a subset of experts can be recombined with the LSB slice for full precision. This slice-level approach enables more fine-grained use of cache, increasing the number of distinct expert representations resident in DRAM and boosting the cache hit probability within a strict memory budget.
Memory Footprint Formulation
Given hidden dimension and experts, the DRAM footprint of a single -bit slice is: If a fraction of cached experts retain high precision and retain low precision, the average bits per expert and effective expert capacity with DRAM budget are:
Slice-Level Cache Management Algorithm
The DBSC eviction and admission protocol operates as follows:
- MSB slices are managed via standard LRU.
- LSB slices have lowest priority and are evicted first when capacity pressure arises.
- After each batch, if the global miss rate exceeds target, low-priority slices are evicted until the miss rate is satisfactory.
Sample pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
procedure DBSC_STEP(requested_experts, cache, C_max, miss_rate_target):
for e in requested_experts:
if cache.contains(e, slice='MSB'):
record_hit(e,'MSB')
else:
cache.load(e,'MSB')
record_miss(e,'MSB')
evict_if_needed(cache, C_max)
if requires_high_precision(e):
if cache.contains(e, slice='LSB'):
record_hit(e,'LSB')
else:
cache.load(e,'LSB')
record_miss(e,'LSB')
evict_if_needed(cache, C_max)
current_M = compute_miss_rate()
if current_M > miss_rate_target:
evict_least_valuable(cache, until_M≤miss_rate_target) |
3. Precision-On-Demand Mechanism
DBSC leverages the typically steep distribution of gating scores by assigning precision dynamically on a per-token basis:
- All selected experts fetch their MSB slice.
- Only experts whose gating score surpasses a token-specific threshold fetch the LSB slice, enabling full precision. The optimization objective is to maximize expected accuracy subject to memory and miss-rate constraint : This approach ensures that the bulk of experts operate at lower precision, conserving memory and energy, while critical experts maintain full expressiveness.
4. Calibration-Free Asymmetric Matryoshka Quantization (AMAT)
To permit seamless mixed-precision expert caching, DBSC employs Calibration-Free Asymmetric Matryoshka Quantization (AMAT). AMAT enables truncation-based extraction of low- and high-bit slices from a single quantized tensor and its zero-point without duplicate storage or additional calibration overhead.
For high-bitwidth and low-bitwidth :
Bit-slice composition and value dequantization proceed as:
This construction ensures exact compatibility between low- and high-bit slices, simplifying both cache management and hardware implementation.
5. Predictive Cache Warmup
Single-batch inference epochs include a prefill phase (broad expert access via parallelism) and a decode phase (narrow, frequent reuse of a small expert subset). The Predictive Cache Warmup (PCW) mechanism exploits the empirical correlation of “hot” experts between prefill and early decode. PCW records per-slice access counts during prefill, then, at the prefill-to-decode transition:
- Evicts LSB slices with the smallest .
- Evicts MSB slices in ascending order of until DRAM constraints are met.
Pseudocode:
1 2 3 4 5 |
procedure PCW(cache, prefill_counts, C_decode):
sort all LSB slices by prefill_counts ascending
evict top K_LSB until cache.size ≤ C_decode
sort all MSB slices by prefill_counts ascending
evict top K_MSB until cache.size ≤ C_decode |
6. Empirical Evaluation and Performance
DBSC and AMAT, as part of SliceMoE, were evaluated on DeepSeek-V2-Lite (160 experts) and Qwen1.5-MoE-A2.7B (240 experts) using a GSM8K 5-shot benchmark. The hardware: XPU (1 GHz, 8192 8-bit PEs), 8 GB LPDDR4 DRAM, and 128 GB UFS 3.1 Flash. Key quantitative results:
- Decode energy reduction by up to 2.37× (DeepSeek-V2-Lite) and 2.85× (Qwen1.5-MoE-A2.7B).
- Decode latency improvement up to 1.81× and 1.64×, respectively.
- DBSC+AMAT achieves accuracy near the high-bit reference along the Pareto frontier for miss rate and energy.
- Predictive Cache Warmup provides an additional up to 2.31× energy reduction and 1.96× speed-up over a cold (empty) cache (Choi et al., 15 Dec 2025).
7. Deployment Considerations and Future Directions
DBSC requires hardware capable of fetching and combining bit-slices at inference time and introduces more intricate per-slice metadata and LRU tracking. Deployments in extremely tight memory regimes or with low-bitwidth (<2 bits) slices may see accuracy degradation. Gate threshold and prefill–decode correlation parameters may require model-specific tuning.
Prospective research avenues include extending DBSC to multi-slice (beyond MSB/LSB) hierarchies, joint optimization of expert routing and slice allocation, native hardware support for bit-slice streaming, and adaptive on-device learning of cache hotness patterns to suppress cold misses even further (Choi et al., 15 Dec 2025).