Dynamic Bit-Sliced Caching for MoE Models

Updated 22 December 2025

Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme that partitions expert weight tensors into MSB and LSB slices for efficient on-device MoE inference.
It employs a precision-on-demand mechanism and predictive cache warmup to dynamically allocate high precision only when needed, maintaining low miss rates under tight DRAM constraints.
Empirical evaluations show that DBSC significantly reduces decode energy and latency, achieving near-maximal accuracy while minimizing costly Flash memory accesses.

Dynamic Bit-Sliced Caching (DBSC) is a cache management scheme for on-device inference with large Mixture-of-Experts (MoE) models under stringent miss-rate constraints. DBSC operates at the granularity of quantized bit-slices of expert weight tensors, caching the most critical precision slices to maximize effective expert capacity and reduce cache miss penalties. Integrated with a precision-on-demand mechanism and specialized quantization, DBSC enables energy- and latency-efficient deployment of MoE models within limited DRAM budgets, dramatically reducing high-latency Flash accesses while preserving near-maximal inference accuracy (Choi et al., 15 Dec 2025).

1. Problem Setting and Motivation

Large-scale MoE LLMs feature tens of billions of expert parameters, often exceeding the few gigabytes available in on-device DRAM. Standard deployments partition experts between DRAM (for fast, low-energy access) and Flash storage (10–100× slower, 50–100× more energy per bit). Even moderate cache miss rates (e.g., 10–30%) incur significant energy and latency penalties, rapidly dominating inference costs and rendering on-device serving impractical without more sophisticated cache control. For practical inference, the instantaneous cache miss rate

$M = \frac{\#\text{misses}}{\#\text{hits}+\#\text{misses}} = 1 - \frac{H}{H+M}$

must typically remain below 5%.

2. Bit-Slice Caching Principles and Workflow

Each expert's quantized weight tensor is partitioned into two bit-slices: a most significant bits (MSB) slice of $b_h$ bits and a least significant bits (LSB) slice of $b_l$ bits. Caching only the MSB slice permits a low-bit approximation of expert weights, sufficient for non-critical experts, while a subset of experts can be recombined with the LSB slice for full precision. This slice-level approach enables more fine-grained use of cache, increasing the number of distinct expert representations resident in DRAM and boosting the cache hit probability within a strict memory budget.

Memory Footprint Formulation

Given hidden dimension $d$ and $n_e$ experts, the DRAM footprint of a single $b_s$ -bit slice is: $F_s = b_s \times d \times n_e \quad (\text{bits})$ If a fraction $p_h$ of cached experts retain high precision and $1-p_h$ retain low precision, the average bits per expert $\bar b$ and effective expert capacity $N_{\rm eff}$ with DRAM budget $C$ are: $\bar b = p_h b_h + (1-p_h) b_l, \qquad N_{\rm eff} = \frac{C}{d \bar b}$

Slice-Level Cache Management Algorithm

The DBSC eviction and admission protocol operates as follows:

MSB slices are managed via standard LRU.
LSB slices have lowest priority and are evicted first when capacity pressure arises.
After each batch, if the global miss rate exceeds target, low-priority slices are evicted until the miss rate is satisfactory.

Sample pseudocode:

procedure DBSC_STEP(requested_experts, cache, C_max, miss_rate_target):
    for e in requested_experts:
        if cache.contains(e, slice='MSB'):
            record_hit(e,'MSB')
        else:
            cache.load(e,'MSB')
            record_miss(e,'MSB')
            evict_if_needed(cache, C_max)
        if requires_high_precision(e):
            if cache.contains(e, slice='LSB'):
                record_hit(e,'LSB')
            else:
                cache.load(e,'LSB')
                record_miss(e,'LSB')
                evict_if_needed(cache, C_max)
    current_M = compute_miss_rate()
    if current_M > miss_rate_target:
        evict_least_valuable(cache, until_M≤miss_rate_target)

3. Precision-On-Demand Mechanism

DBSC leverages the typically steep distribution of gating scores by assigning precision dynamically on a per-token basis:

All selected experts fetch their MSB slice.
Only experts whose gating score surpasses a token-specific threshold fetch the LSB slice, enabling full precision. The optimization objective is to maximize expected accuracy $A$ subject to memory $C$ and miss-rate constraint $M \leq M_0$ : $\max_{p_h}\;A\bigl(p_h\,b_h + (1-p_h)\,b_l\bigr) \quad \text{s.t.}\;\; p_h\,F_{h} + (1-p_h)\,F_{l}\le C, \; M\le M_0$ This approach ensures that the bulk of experts operate at lower precision, conserving memory and energy, while critical experts maintain full expressiveness.

4. Calibration-Free Asymmetric Matryoshka Quantization (AMAT)

To permit seamless mixed-precision expert caching, DBSC employs Calibration-Free Asymmetric Matryoshka Quantization (AMAT). AMAT enables truncation-based extraction of low- and high-bit slices from a single quantized tensor and its zero-point without duplicate storage or additional calibration overhead.

For high-bitwidth $b_{high}$ and low-bitwidth $b_{low}$ : $\mathrm{shift} = b_{high} - b_{low}$

$q_{low} = \bigl\lfloor q_{high} / 2^{\mathrm{shift}}\bigr\rfloor, \quad zp_{low} = \bigl\lfloor zp_{high} / 2^{\mathrm{shift}}\bigr\rfloor$

Bit-slice composition and value dequantization proceed as: $q_{full} =\bigl(q_{MSB} \ll b_{LSB}\bigr) + q_{LSB}$

$w = \mathrm{scale}_{h}\,\bigl(q_{full}-zp_{h}\bigr)$

This construction ensures exact compatibility between low- and high-bit slices, simplifying both cache management and hardware implementation.

5. Predictive Cache Warmup

Single-batch inference epochs include a prefill phase (broad expert access via parallelism) and a decode phase (narrow, frequent reuse of a small expert subset). The Predictive Cache Warmup (PCW) mechanism exploits the empirical correlation of “hot” experts between prefill and early decode. PCW records per-slice access counts $c_s$ during prefill, then, at the prefill-to-decode transition:

Evicts LSB slices with the smallest $c_s$ .
Evicts MSB slices in ascending order of $c_s$ until DRAM constraints are met.

Pseudocode:

procedure PCW(cache, prefill_counts, C_decode):
    sort all LSB slices by prefill_counts ascending
    evict top K_LSB until cache.size ≤ C_decode
    sort all MSB slices by prefill_counts ascending
    evict top K_MSB until cache.size ≤ C_decode

This preemption creates a cache population closely matched to anticipated decode-stage access patterns, reducing cold misses and access costs.

6. Empirical Evaluation and Performance

DBSC and AMAT, as part of SliceMoE, were evaluated on DeepSeek-V2-Lite (160 experts) and Qwen1.5-MoE-A2.7B (240 experts) using a GSM8K 5-shot benchmark. The hardware: XPU (1 GHz, 8192 8-bit PEs), 8 GB LPDDR4 DRAM, and 128 GB UFS 3.1 Flash. Key quantitative results:

Decode energy reduction by up to 2.37× (DeepSeek-V2-Lite) and 2.85× (Qwen1.5-MoE-A2.7B).
Decode latency improvement up to 1.81× and 1.64×, respectively.
DBSC+AMAT achieves accuracy near the high-bit reference along the Pareto frontier for miss rate and energy.
Predictive Cache Warmup provides an additional up to 2.31× energy reduction and 1.96× speed-up over a cold (empty) cache (Choi et al., 15 Dec 2025).

7. Deployment Considerations and Future Directions

DBSC requires hardware capable of fetching and combining bit-slices at inference time and introduces more intricate per-slice metadata and LRU tracking. Deployments in extremely tight memory regimes or with low-bitwidth (<2 bits) slices may see accuracy degradation. Gate threshold and prefill–decode correlation parameters may require model-specific tuning.

Prospective research avenues include extending DBSC to multi-slice (beyond MSB/LSB) hierarchies, joint optimization of expert routing and slice allocation, native hardware support for bit-slice streaming, and adaptive on-device learning of cache hotness patterns to suppress cold misses even further (Choi et al., 15 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SliceMoE: Bit-Sliced Expert Caching under Miss-Rate Constraints for Efficient MoE Inference (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Dynamic Bit-Sliced Caching (DBSC).