Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Token/Tile Adaptive Quantization

Updated 14 April 2026
  • Dynamic Token/Tile Adaptive Quantization is a method that adjusts bit precision for each token or tile based on local information content and saliency.
  • Techniques such as adaptive bit allocation, routing with small neural experts, and recursive residual quantization optimize accuracy while saving memory and computation.
  • Empirical studies in language and vision models show significant memory savings and speedups with minimal accuracy loss by dynamically managing quantization error.

Dynamic Token/Tile Adaptive Quantization refers to a class of methods that allocate quantization precision or bit-rate in a spatially or semantically adaptive manner at the granularity of tokens (elements in a sequence, typically in LLMs) or tiles (spatial blocks, as in images or hidden activations). Rather than assigning a fixed number of quantization bits uniformly, these methods dynamically adjust bit-width or allocate quantization resources according to the local information content, token- or tile-wise sensitivity, or saliency, enabling improved trade-offs between model accuracy, computational/memory efficiency, and communication overhead.

1. Motivation and Overview of Dynamic Quantization

Conventional quantization schemes apply a static bit-width uniformly across all data elements (tokens or tiles) within a model component, disregarding the heterogeneity in signal sensitivity or statistical distribution. However, empirical studies show that both in LLMs and vision-LLMs (VLMs), quantization error is highly concentrated in specific locations: either on rare "outlier" tokens or spatial/image regions with increased local complexity. This motivates adaptive allocation of bit-width or quantization parameters, tailored to token-level or tile-level statistics, to minimize overall error for a given resource budget (Wang et al., 21 Feb 2026, He et al., 2 Jun 2025, Chen et al., 2024, Minnen et al., 2018).

Token-adaptive quantization targets the dynamic, context-sensitive distribution of activations in LLMs or VLMs, while tile-adaptive quantization is particularly valuable in imaging and vision, where spatial complexity or saliency can vary at fine granularity.

2. Algorithmic Approaches and Mathematical Formulation

Adaptive Bit Allocation and Quantization Operators

Token- or Tile-wise Bit-width Selection: A central mechanism is to determine, for each token or tile tt, the required number of quantization bits btb_t or, equivalently, the number of quantization "slices" or blocks to employ. Allocation is typically driven by measurements of local information content, such as activation entropy (He et al., 2 Jun 2025), token-level sensitivity (Wang et al., 21 Feb 2026), or local distortion/residual error (Minnen et al., 2018).

Quantization Operator: Consider a data vector xtRdx_t \in \mathbb{R}^d (e.g., token embedding or image tile), quantized using a per-token or per-tile operator: Q(xt;bt)=clip(round(xtst)+zt,0,2bt1),Q(x_t; b_t) = \mathrm{clip}\left(\mathrm{round}\left(\frac{x_t}{s_t}\right) + z_t, 0, 2^{b_t}-1 \right), where sts_t and ztz_t are local scale and zero-point parameters, which may be dynamically determined per token/tile. Some methods hierarchically decompose quantization into multiple residual slices, each capturing finer quantization detail, to facilitate elastic allocation (Wang et al., 21 Feb 2026).

Token/Tile Selection Criteria:

  • Entropy-based: Assign higher bit-widths to tokens/tiles with larger activation entropy,

H(xt)=kpt,klogpt,kH(x_t) = -\sum_k p_{t,k} \log p_{t,k}

where pt,kp_{t,k} are normalized component magnitudes (He et al., 2 Jun 2025).

  • Saliency-based: For tokens in attention mechanisms, assign bit-widths based on normalized attention scores, reflecting their downstream impact (He et al., 2024).
  • Residual/distortion-based: Dynamically increase bits for a tile until its local distortion (e.g., MSE) falls below a threshold (Minnen et al., 2018).

Routing and MoE-style Quantization

Several frameworks employ token- or tile-aware routers: small neural networks that predict, per data element, the necessary quantization configuration—such as number of bit slices, selection of specialized experts for error compensation, or block-wise allocation (Wang et al., 21 Feb 2026, Jia et al., 27 Feb 2026).

A common structure is a gating function

Gt,e=1(St,e>δ)G_{t,e} = \mathbf{1}(S_{t,e} > \delta)

where St,eS_{t,e} are router logits indicating the need for slice btb_t0 for token btb_t1. The bit allocation then follows as btb_t2.

3. Key Techniques and Methodological Innovations

Recursive Residual Quantization

MoBiQuant introduces a "many-in-one" recursive residual quantization scheme (MoBiSlice) that decomposes weights into multiple quantized residual slices btb_t3. At inference, the cumulative sum of a selected subset of slices yields the quantized weights at the requested bit-width. This enables smooth, calibration-free precision switching and token-adaptive bit-width allocation, mitigating "precision-dependent outlier migration" (Wang et al., 21 Feb 2026).

Tile-wise Adaptive Quantization in Vision

TAH-Quant partitions activations into tiles, and allocates bit-width per token based on a tile's entropy, assigning higher bits to "information-rich" tokens/tiles and lower bits where activations are sparse or dominated by outliers. A pivot-based Hadamard transform further suppresses the effect of outliers within a tile (He et al., 2 Jun 2025).

In image compression, spatial tiles are processed by a recurrent autoencoder, with early stopping based on local distortion thresholds. Bit-rate is thus spatially adapted without explicit per-tile scale step size (Minnen et al., 2018).

Token/Tile-aware Error Compensation

Quant Experts applies a mixture-of-experts approach: channels are partitioned into globally important (token-independent) vs. locally important (token-dependent) sets using frequency statistics over token activations. A shared expert compensates quantization error on token-independent channels, while routed experts adaptively compensate errors for input-dependent (token/tile) channel patterns, with a lightweight routing matrix selecting the best expert per input (Jia et al., 27 Feb 2026).

4. Salient Token/Tile Identification and Outlier Handling

Dynamic token/tile quantization often relies on identification of critical elements:

  • Saliency metrics (ZipCache): Normalized attention scores provide an unbiased estimate of each token's importance, mitigating position bias from causal attention masking. Tokens are ranked and assigned either high or low bit-width quantization accordingly (He et al., 2024).
  • Outlier isolation (PrefixQuant): PrefixQuant isolates rare token-wise outliers by identifying, via a fixed calibration pass, which tokens exhibit activation maxima exceeding a factor threshold over the median. These tokens are prefixed in the KV cache, separating them from the quantization scale used for the rest, thereby enabling static low-bit quantization for non-outlier tokens without per-token dynamic scaling (Chen et al., 2024).

5. Memory, Computational, and Implementation Considerations

Adaptive quantization methods must carefully balance overhead:

  • Parameter storage: Fine-grained groupwise quantization incurs high storage for per-token/group scalars. Channel-separable schemes (ZipCache) employ per-channel normalization pre-quantization, reducing parameter count to btb_t4, orders of magnitude less than groupwise per-token scalars (He et al., 2024).
  • Computation: Online routers and reconstruction experts contribute little overhead compared to matrix-matrix multiplications when configured with modest rank (e.g., adapter rank btb_t5, number of experts btb_t6), keeping total compute sub-1% of baseline (Jia et al., 27 Feb 2026, Wang et al., 21 Feb 2026).
  • Memory: Methods such as TAH-Quant avoid extra activation or gradient storage by working on current tiles only, in contrast to methods requiring activation error compensation buffers (He et al., 2 Jun 2025).

Implementation for large-scale or bandwidth-limited settings incorporates hardware-awareness—e.g., tensor packing for coalesced memory in GPU (ZipCache), bit-major kernel optimization (MoBiQuant), and fusion of quantization normalization into kernel launches.

6. Empirical Performance and Trade-offs

Dynamic token/tile adaptive quantization methods consistently demonstrate that careful, input-adaptive allocation achieves superior accuracy-compression trade-offs over static quantization:

Method Key Metric Speedup/Compression Accuracy Degradation
MoBiQuant LLaMA3-8B, PPL=7.31 2.7× vs FP16 None at matched bpp
ZipCache KV cache, 4.98× compress 19.8% memory save 0.38% drop
PrefixQuant LLaMA3-8B, 2.81× speedup >2.7× GEMM speedup ≤1% (with finetune)
QuantExperts Qwen2VL-72B, W4A6 3.5–4.5× vs full-16 +5.44% over MBQ
TAH-Quant GPT-2XL pipeline, INT3/4 Up to 4.3× Matches FP16, AQ-SGD

A common finding is that the memory and throughput savings scale nearly linearly with the fraction of elements assigned low bit-width, while empirical accuracy displays an "elbow": beyond a threshold reduction in high-precision assignments, error begins to rise sharply (He et al., 2024).

7. Limitations, Applicability, and Extensions

Applicability is greatest in domains with high variability in token/tile information content, notably LLM inference/generation (where token-wise activation distribution can vary widely with prompt structure, language, etc.) and image/vision tasks exhibiting spatial heterogeneity.

Limitations include:

  • Dependence on precomputed or stable outlier/token saliency statistics (PrefixQuant): if outlier identities are highly dynamic, offline isolation may not capture all problematic elements (Chen et al., 2024).
  • Overhead: For models with massive input spaces, the per-token routing logic and scale/bias updates may impose nontrivial inference cost.
  • Calibration: While calibration-free switching is enabled in some frameworks (MoBiQuant), others require per-configuration grid search or block-wise finetuning to optimally set static quantization parameters.

Generalizations include extension to other axes (time, batch, etc.), and across modalities (tile-adaptive quantization of vision-layer patches; channel-adaptive quantization for multi-head attention). Several methods directly support both token-adaptive (sequence) and tile-adaptive (spatial/image) regimes (Jia et al., 27 Feb 2026, He et al., 2 Jun 2025).

In summary, dynamic token/tile adaptive quantization represents a confluence of input-aware resource allocation, hierarchical or MoE compensation, and context-sensitive error control, supporting robust low-bit inference and training with minimal accuracy loss across a spectrum of neural architectures and modalities (Wang et al., 21 Feb 2026, He et al., 2 Jun 2025, Chen et al., 2024, Minnen et al., 2018, Jia et al., 27 Feb 2026, He et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Token/Tile Adaptive Quantization.