Papers
Topics
Authors
Recent
2000 character limit reached

Dynamic Bit-Sliced Caching for MoE Models

Updated 22 December 2025
  • Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme that partitions expert weight tensors into MSB and LSB slices for efficient on-device MoE inference.
  • It employs a precision-on-demand mechanism and predictive cache warmup to dynamically allocate high precision only when needed, maintaining low miss rates under tight DRAM constraints.
  • Empirical evaluations show that DBSC significantly reduces decode energy and latency, achieving near-maximal accuracy while minimizing costly Flash memory accesses.

Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme for on-device inference with large Mixture-of-Experts (MoE) models under stringent miss-rate constraints. DBSC operates at the granularity of quantized bit-slices of expert weight tensors, caching the most critical precision slices to maximize effective expert capacity and reduce cache miss penalties. Integrated with a precision-on-demand mechanism and specialized quantization, DBSC enables energy- and latency-efficient deployment of MoE models within limited DRAM budgets, dramatically reducing high-latency Flash accesses while preserving near-maximal inference accuracy (Choi et al., 15 Dec 2025).

1. Problem Setting and Motivation

Large-scale MoE LLMs feature tens of billions of expert parameters, often exceeding the few gigabytes available in on-device DRAM. Standard deployments partition experts between DRAM (for fast, low-energy access) and Flash storage (10–100× slower, 50–100× more energy per bit). Even moderate cache miss rates (e.g., 10–30%) incur significant energy and latency penalties, rapidly dominating inference costs and rendering on-device serving impractical without more sophisticated cache control. For practical inference, the instantaneous cache miss rate

M=#misses#hits+#misses=1HH+MM = \frac{\#\text{misses}}{\#\text{hits}+\#\text{misses}} = 1 - \frac{H}{H+M}

must typically remain below 5%.

2. Bit-Slice Caching Principles and Workflow

Each expert's quantized weight tensor is partitioned into two bit-slices: a most significant bits (MSB) slice of bhb_h bits and a least significant bits (LSB) slice of blb_l bits. Caching only the MSB slice permits a low-bit approximation of expert weights, sufficient for non-critical experts, while a subset of experts can be recombined with the LSB slice for full precision. This slice-level approach enables more fine-grained use of cache, increasing the number of distinct expert representations resident in DRAM and boosting the cache hit probability within a strict memory budget.

Memory Footprint Formulation

Given hidden dimension dd and nen_e experts, the DRAM footprint of a single bsb_s-bit slice is: Fs=bs×d×ne(bits)F_s = b_s \times d \times n_e \quad (\text{bits}) If a fraction php_h of cached experts retain high precision and 1ph1-p_h retain low precision, the average bits per expert bˉ\bar b and effective expert capacity NeffN_{\rm eff} with DRAM budget CC are: bˉ=phbh+(1ph)bl,Neff=Cdbˉ\bar b = p_h b_h + (1-p_h) b_l, \qquad N_{\rm eff} = \frac{C}{d \bar b}

Slice-Level Cache Management Algorithm

The DBSC eviction and admission protocol operates as follows:

  • MSB slices are managed via standard LRU.
  • LSB slices have lowest priority and are evicted first when capacity pressure arises.
  • After each batch, if the global miss rate exceeds target, low-priority slices are evicted until the miss rate is satisfactory.

Sample pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
procedure DBSC_STEP(requested_experts, cache, C_max, miss_rate_target):
    for e in requested_experts:
        if cache.contains(e, slice='MSB'):
            record_hit(e,'MSB')
        else:
            cache.load(e,'MSB')
            record_miss(e,'MSB')
            evict_if_needed(cache, C_max)
        if requires_high_precision(e):
            if cache.contains(e, slice='LSB'):
                record_hit(e,'LSB')
            else:
                cache.load(e,'LSB')
                record_miss(e,'LSB')
                evict_if_needed(cache, C_max)
    current_M = compute_miss_rate()
    if current_M > miss_rate_target:
        evict_least_valuable(cache, until_Mmiss_rate_target)

3. Precision-On-Demand Mechanism

DBSC leverages the typically steep distribution of gating scores by assigning precision dynamically on a per-token basis:

  • All selected experts fetch their MSB slice.
  • Only experts whose gating score surpasses a token-specific threshold fetch the LSB slice, enabling full precision. The optimization objective is to maximize expected accuracy AA subject to memory CC and miss-rate constraint MM0M \leq M_0: maxph  A(phbh+(1ph)bl)s.t.    phFh+(1ph)FlC,  MM0\max_{p_h}\;A\bigl(p_h\,b_h + (1-p_h)\,b_l\bigr) \quad \text{s.t.}\;\; p_h\,F_{h} + (1-p_h)\,F_{l}\le C, \; M\le M_0 This approach ensures that the bulk of experts operate at lower precision, conserving memory and energy, while critical experts maintain full expressiveness.

4. Calibration-Free Asymmetric Matryoshka Quantization (AMAT)

To permit seamless mixed-precision expert caching, DBSC employs Calibration-Free Asymmetric Matryoshka Quantization (AMAT). AMAT enables truncation-based extraction of low- and high-bit slices from a single quantized tensor and its zero-point without duplicate storage or additional calibration overhead.

For high-bitwidth bhighb_{high} and low-bitwidth blowb_{low}: shift=bhighblow\mathrm{shift} = b_{high} - b_{low}

qlow=qhigh/2shift,zplow=zphigh/2shiftq_{low} = \bigl\lfloor q_{high} / 2^{\mathrm{shift}}\bigr\rfloor, \quad zp_{low} = \bigl\lfloor zp_{high} / 2^{\mathrm{shift}}\bigr\rfloor

Bit-slice composition and value dequantization proceed as: qfull=(qMSBbLSB)+qLSBq_{full} =\bigl(q_{MSB} \ll b_{LSB}\bigr) + q_{LSB}

w=scaleh(qfullzph)w = \mathrm{scale}_{h}\,\bigl(q_{full}-zp_{h}\bigr)

This construction ensures exact compatibility between low- and high-bit slices, simplifying both cache management and hardware implementation.

5. Predictive Cache Warmup

Single-batch inference epochs include a prefill phase (broad expert access via parallelism) and a decode phase (narrow, frequent reuse of a small expert subset). The Predictive Cache Warmup (PCW) mechanism exploits the empirical correlation of “hot” experts between prefill and early decode. PCW records per-slice access counts csc_s during prefill, then, at the prefill-to-decode transition:

  1. Evicts LSB slices with the smallest csc_s.
  2. Evicts MSB slices in ascending order of csc_s until DRAM constraints are met.

Pseudocode:

1
2
3
4
5
procedure PCW(cache, prefill_counts, C_decode):
    sort all LSB slices by prefill_counts ascending
    evict top K_LSB until cache.size  C_decode
    sort all MSB slices by prefill_counts ascending
    evict top K_MSB until cache.size  C_decode
This preemption creates a cache population closely matched to anticipated decode-stage access patterns, reducing cold misses and access costs.

6. Empirical Evaluation and Performance

DBSC and AMAT, as part of SliceMoE, were evaluated on DeepSeek-V2-Lite (160 experts) and Qwen1.5-MoE-A2.7B (240 experts) using a GSM8K 5-shot benchmark. The hardware: XPU (1 GHz, 8192 8-bit PEs), 8 GB LPDDR4 DRAM, and 128 GB UFS 3.1 Flash. Key quantitative results:

  • Decode energy reduction by up to 2.37× (DeepSeek-V2-Lite) and 2.85× (Qwen1.5-MoE-A2.7B).
  • Decode latency improvement up to 1.81× and 1.64×, respectively.
  • DBSC+AMAT achieves accuracy near the high-bit reference along the Pareto frontier for miss rate and energy.
  • Predictive Cache Warmup provides an additional up to 2.31× energy reduction and 1.96× speed-up over a cold (empty) cache (Choi et al., 15 Dec 2025).

7. Deployment Considerations and Future Directions

DBSC requires hardware capable of fetching and combining bit-slices at inference time and introduces more intricate per-slice metadata and LRU tracking. Deployments in extremely tight memory regimes or with low-bitwidth (<2 bits) slices may see accuracy degradation. Gate threshold and prefill–decode correlation parameters may require model-specific tuning.

Prospective research avenues include extending DBSC to multi-slice (beyond MSB/LSB) hierarchies, joint optimization of expert routing and slice allocation, native hardware support for bit-slice streaming, and adaptive on-device learning of cache hotness patterns to suppress cold misses even further (Choi et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Dynamic Bit-Sliced Caching (DBSC).