Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Bit-Sliced Caching for MoE Models

Updated 22 December 2025
  • Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme that partitions expert weight tensors into MSB and LSB slices for efficient on-device MoE inference.
  • It employs a precision-on-demand mechanism and predictive cache warmup to dynamically allocate high precision only when needed, maintaining low miss rates under tight DRAM constraints.
  • Empirical evaluations show that DBSC significantly reduces decode energy and latency, achieving near-maximal accuracy while minimizing costly Flash memory accesses.

Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme for on-device inference with large Mixture-of-Experts (MoE) models under stringent miss-rate constraints. DBSC operates at the granularity of quantized bit-slices of expert weight tensors, caching the most critical precision slices to maximize effective expert capacity and reduce cache miss penalties. Integrated with a precision-on-demand mechanism and specialized quantization, DBSC enables energy- and latency-efficient deployment of MoE models within limited DRAM budgets, dramatically reducing high-latency Flash accesses while preserving near-maximal inference accuracy (Choi et al., 15 Dec 2025).

1. Problem Setting and Motivation

Large-scale MoE LLMs feature tens of billions of expert parameters, often exceeding the few gigabytes available in on-device DRAM. Standard deployments partition experts between DRAM (for fast, low-energy access) and Flash storage (10–100× slower, 50–100× more energy per bit). Even moderate cache miss rates (e.g., 10–30%) incur significant energy and latency penalties, rapidly dominating inference costs and rendering on-device serving impractical without more sophisticated cache control. For practical inference, the instantaneous cache miss rate

M=#misses#hits+#misses=1HH+MM = \frac{\#\text{misses}}{\#\text{hits}+\#\text{misses}} = 1 - \frac{H}{H+M}

must typically remain below 5%.

2. Bit-Slice Caching Principles and Workflow

Each expert's quantized weight tensor is partitioned into two bit-slices: a most significant bits (MSB) slice of bhb_h bits and a least significant bits (LSB) slice of blb_l bits. Caching only the MSB slice permits a low-bit approximation of expert weights, sufficient for non-critical experts, while a subset of experts can be recombined with the LSB slice for full precision. This slice-level approach enables more fine-grained use of cache, increasing the number of distinct expert representations resident in DRAM and boosting the cache hit probability within a strict memory budget.

Memory Footprint Formulation

Given hidden dimension dd and nen_e experts, the DRAM footprint of a single bsb_s-bit slice is: Fs=bs×d×ne(bits)F_s = b_s \times d \times n_e \quad (\text{bits}) If a fraction php_h of cached experts retain high precision and 1ph1-p_h retain low precision, the average bits per expert bˉ\bar b and effective expert capacity bhb_h0 with DRAM budget bhb_h1 are: bhb_h2

Slice-Level Cache Management Algorithm

The DBSC eviction and admission protocol operates as follows:

  • MSB slices are managed via standard LRU.
  • LSB slices have lowest priority and are evicted first when capacity pressure arises.
  • After each batch, if the global miss rate exceeds target, low-priority slices are evicted until the miss rate is satisfactory.

Sample pseudocode: blb_l6

3. Precision-On-Demand Mechanism

DBSC leverages the typically steep distribution of gating scores by assigning precision dynamically on a per-token basis:

  • All selected experts fetch their MSB slice.
  • Only experts whose gating score surpasses a token-specific threshold fetch the LSB slice, enabling full precision. The optimization objective is to maximize expected accuracy bhb_h3 subject to memory bhb_h4 and miss-rate constraint bhb_h5: bhb_h6 This approach ensures that the bulk of experts operate at lower precision, conserving memory and energy, while critical experts maintain full expressiveness.

4. Calibration-Free Asymmetric Matryoshka Quantization (AMAT)

To permit seamless mixed-precision expert caching, DBSC employs Calibration-Free Asymmetric Matryoshka Quantization (AMAT). AMAT enables truncation-based extraction of low- and high-bit slices from a single quantized tensor and its zero-point without duplicate storage or additional calibration overhead.

For high-bitwidth bhb_h7 and low-bitwidth bhb_h8: bhb_h9

blb_l0

Bit-slice composition and value dequantization proceed as: blb_l1

blb_l2

This construction ensures exact compatibility between low- and high-bit slices, simplifying both cache management and hardware implementation.

5. Predictive Cache Warmup

Single-batch inference epochs include a prefill phase (broad expert access via parallelism) and a decode phase (narrow, frequent reuse of a small expert subset). The Predictive Cache Warmup (PCW) mechanism exploits the empirical correlation of “hot” experts between prefill and early decode. PCW records per-slice access counts blb_l3 during prefill, then, at the prefill-to-decode transition:

  1. Evicts LSB slices with the smallest blb_l4.
  2. Evicts MSB slices in ascending order of blb_l5 until DRAM constraints are met.

Pseudocode: blb_l7 This preemption creates a cache population closely matched to anticipated decode-stage access patterns, reducing cold misses and access costs.

6. Empirical Evaluation and Performance

DBSC and AMAT, as part of SliceMoE, were evaluated on DeepSeek-V2-Lite (160 experts) and Qwen1.5-MoE-A2.7B (240 experts) using a GSM8K 5-shot benchmark. The hardware: XPU (1 GHz, 8192 8-bit PEs), 8 GB LPDDR4 DRAM, and 128 GB UFS 3.1 Flash. Key quantitative results:

  • Decode energy reduction by up to 2.37× (DeepSeek-V2-Lite) and 2.85× (Qwen1.5-MoE-A2.7B).
  • Decode latency improvement up to 1.81× and 1.64×, respectively.
  • DBSC+AMAT achieves accuracy near the high-bit reference along the Pareto frontier for miss rate and energy.
  • Predictive Cache Warmup provides an additional up to 2.31× energy reduction and 1.96× speed-up over a cold (empty) cache (Choi et al., 15 Dec 2025).

7. Deployment Considerations and Future Directions

DBSC requires hardware capable of fetching and combining bit-slices at inference time and introduces more intricate per-slice metadata and LRU tracking. Deployments in extremely tight memory regimes or with low-bitwidth (<2 bits) slices may see accuracy degradation. Gate threshold and prefill–decode correlation parameters may require model-specific tuning.

Prospective research avenues include extending DBSC to multi-slice (beyond MSB/LSB) hierarchies, joint optimization of expert routing and slice allocation, native hardware support for bit-slice streaming, and adaptive on-device learning of cache hotness patterns to suppress cold misses even further (Choi et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Bit-Sliced Caching (DBSC).