Dynamic Token/Tile Adaptive Quantization
- Dynamic Token/Tile Adaptive Quantization is a method that adjusts bit precision for each token or tile based on local information content and saliency.
- Techniques such as adaptive bit allocation, routing with small neural experts, and recursive residual quantization optimize accuracy while saving memory and computation.
- Empirical studies in language and vision models show significant memory savings and speedups with minimal accuracy loss by dynamically managing quantization error.
Dynamic Token/Tile Adaptive Quantization refers to a class of methods that allocate quantization precision or bit-rate in a spatially or semantically adaptive manner at the granularity of tokens (elements in a sequence, typically in LLMs) or tiles (spatial blocks, as in images or hidden activations). Rather than assigning a fixed number of quantization bits uniformly, these methods dynamically adjust bit-width or allocate quantization resources according to the local information content, token- or tile-wise sensitivity, or saliency, enabling improved trade-offs between model accuracy, computational/memory efficiency, and communication overhead.
1. Motivation and Overview of Dynamic Quantization
Conventional quantization schemes apply a static bit-width uniformly across all data elements (tokens or tiles) within a model component, disregarding the heterogeneity in signal sensitivity or statistical distribution. However, empirical studies show that both in LLMs and vision-LLMs (VLMs), quantization error is highly concentrated in specific locations: either on rare "outlier" tokens or spatial/image regions with increased local complexity. This motivates adaptive allocation of bit-width or quantization parameters, tailored to token-level or tile-level statistics, to minimize overall error for a given resource budget (Wang et al., 21 Feb 2026, He et al., 2 Jun 2025, Chen et al., 2024, Minnen et al., 2018).
Token-adaptive quantization targets the dynamic, context-sensitive distribution of activations in LLMs or VLMs, while tile-adaptive quantization is particularly valuable in imaging and vision, where spatial complexity or saliency can vary at fine granularity.
2. Algorithmic Approaches and Mathematical Formulation
Adaptive Bit Allocation and Quantization Operators
Token- or Tile-wise Bit-width Selection: A central mechanism is to determine, for each token or tile , the required number of quantization bits or, equivalently, the number of quantization "slices" or blocks to employ. Allocation is typically driven by measurements of local information content, such as activation entropy (He et al., 2 Jun 2025), token-level sensitivity (Wang et al., 21 Feb 2026), or local distortion/residual error (Minnen et al., 2018).
Quantization Operator: Consider a data vector (e.g., token embedding or image tile), quantized using a per-token or per-tile operator: where and are local scale and zero-point parameters, which may be dynamically determined per token/tile. Some methods hierarchically decompose quantization into multiple residual slices, each capturing finer quantization detail, to facilitate elastic allocation (Wang et al., 21 Feb 2026).
Token/Tile Selection Criteria:
- Entropy-based: Assign higher bit-widths to tokens/tiles with larger activation entropy,
where are normalized component magnitudes (He et al., 2 Jun 2025).
- Saliency-based: For tokens in attention mechanisms, assign bit-widths based on normalized attention scores, reflecting their downstream impact (He et al., 2024).
- Residual/distortion-based: Dynamically increase bits for a tile until its local distortion (e.g., MSE) falls below a threshold (Minnen et al., 2018).
Routing and MoE-style Quantization
Several frameworks employ token- or tile-aware routers: small neural networks that predict, per data element, the necessary quantization configuration—such as number of bit slices, selection of specialized experts for error compensation, or block-wise allocation (Wang et al., 21 Feb 2026, Jia et al., 27 Feb 2026).
A common structure is a gating function
where are router logits indicating the need for slice 0 for token 1. The bit allocation then follows as 2.
3. Key Techniques and Methodological Innovations
Recursive Residual Quantization
MoBiQuant introduces a "many-in-one" recursive residual quantization scheme (MoBiSlice) that decomposes weights into multiple quantized residual slices 3. At inference, the cumulative sum of a selected subset of slices yields the quantized weights at the requested bit-width. This enables smooth, calibration-free precision switching and token-adaptive bit-width allocation, mitigating "precision-dependent outlier migration" (Wang et al., 21 Feb 2026).
Tile-wise Adaptive Quantization in Vision
TAH-Quant partitions activations into tiles, and allocates bit-width per token based on a tile's entropy, assigning higher bits to "information-rich" tokens/tiles and lower bits where activations are sparse or dominated by outliers. A pivot-based Hadamard transform further suppresses the effect of outliers within a tile (He et al., 2 Jun 2025).
In image compression, spatial tiles are processed by a recurrent autoencoder, with early stopping based on local distortion thresholds. Bit-rate is thus spatially adapted without explicit per-tile scale step size (Minnen et al., 2018).
Token/Tile-aware Error Compensation
Quant Experts applies a mixture-of-experts approach: channels are partitioned into globally important (token-independent) vs. locally important (token-dependent) sets using frequency statistics over token activations. A shared expert compensates quantization error on token-independent channels, while routed experts adaptively compensate errors for input-dependent (token/tile) channel patterns, with a lightweight routing matrix selecting the best expert per input (Jia et al., 27 Feb 2026).
4. Salient Token/Tile Identification and Outlier Handling
Dynamic token/tile quantization often relies on identification of critical elements:
- Saliency metrics (ZipCache): Normalized attention scores provide an unbiased estimate of each token's importance, mitigating position bias from causal attention masking. Tokens are ranked and assigned either high or low bit-width quantization accordingly (He et al., 2024).
- Outlier isolation (PrefixQuant): PrefixQuant isolates rare token-wise outliers by identifying, via a fixed calibration pass, which tokens exhibit activation maxima exceeding a factor threshold over the median. These tokens are prefixed in the KV cache, separating them from the quantization scale used for the rest, thereby enabling static low-bit quantization for non-outlier tokens without per-token dynamic scaling (Chen et al., 2024).
5. Memory, Computational, and Implementation Considerations
Adaptive quantization methods must carefully balance overhead:
- Parameter storage: Fine-grained groupwise quantization incurs high storage for per-token/group scalars. Channel-separable schemes (ZipCache) employ per-channel normalization pre-quantization, reducing parameter count to 4, orders of magnitude less than groupwise per-token scalars (He et al., 2024).
- Computation: Online routers and reconstruction experts contribute little overhead compared to matrix-matrix multiplications when configured with modest rank (e.g., adapter rank 5, number of experts 6), keeping total compute sub-1% of baseline (Jia et al., 27 Feb 2026, Wang et al., 21 Feb 2026).
- Memory: Methods such as TAH-Quant avoid extra activation or gradient storage by working on current tiles only, in contrast to methods requiring activation error compensation buffers (He et al., 2 Jun 2025).
Implementation for large-scale or bandwidth-limited settings incorporates hardware-awareness—e.g., tensor packing for coalesced memory in GPU (ZipCache), bit-major kernel optimization (MoBiQuant), and fusion of quantization normalization into kernel launches.
6. Empirical Performance and Trade-offs
Dynamic token/tile adaptive quantization methods consistently demonstrate that careful, input-adaptive allocation achieves superior accuracy-compression trade-offs over static quantization:
| Method | Key Metric | Speedup/Compression | Accuracy Degradation |
|---|---|---|---|
| MoBiQuant | LLaMA3-8B, PPL=7.31 | 2.7× vs FP16 | None at matched bpp |
| ZipCache | KV cache, 4.98× compress | 19.8% memory save | 0.38% drop |
| PrefixQuant | LLaMA3-8B, 2.81× speedup | >2.7× GEMM speedup | ≤1% (with finetune) |
| QuantExperts | Qwen2VL-72B, W4A6 | 3.5–4.5× vs full-16 | +5.44% over MBQ |
| TAH-Quant | GPT-2XL pipeline, INT3/4 | Up to 4.3× | Matches FP16, AQ-SGD |
A common finding is that the memory and throughput savings scale nearly linearly with the fraction of elements assigned low bit-width, while empirical accuracy displays an "elbow": beyond a threshold reduction in high-precision assignments, error begins to rise sharply (He et al., 2024).
7. Limitations, Applicability, and Extensions
Applicability is greatest in domains with high variability in token/tile information content, notably LLM inference/generation (where token-wise activation distribution can vary widely with prompt structure, language, etc.) and image/vision tasks exhibiting spatial heterogeneity.
Limitations include:
- Dependence on precomputed or stable outlier/token saliency statistics (PrefixQuant): if outlier identities are highly dynamic, offline isolation may not capture all problematic elements (Chen et al., 2024).
- Overhead: For models with massive input spaces, the per-token routing logic and scale/bias updates may impose nontrivial inference cost.
- Calibration: While calibration-free switching is enabled in some frameworks (MoBiQuant), others require per-configuration grid search or block-wise finetuning to optimally set static quantization parameters.
Generalizations include extension to other axes (time, batch, etc.), and across modalities (tile-adaptive quantization of vision-layer patches; channel-adaptive quantization for multi-head attention). Several methods directly support both token-adaptive (sequence) and tile-adaptive (spatial/image) regimes (Jia et al., 27 Feb 2026, He et al., 2 Jun 2025).
In summary, dynamic token/tile adaptive quantization represents a confluence of input-aware resource allocation, hierarchical or MoE compensation, and context-sensitive error control, supporting robust low-bit inference and training with minimal accuracy loss across a spectrum of neural architectures and modalities (Wang et al., 21 Feb 2026, He et al., 2 Jun 2025, Chen et al., 2024, Minnen et al., 2018, Jia et al., 27 Feb 2026, He et al., 2024).