AccKV: Adaptive KV Cache for AV-LLMs
- AccKV is an adaptive caching method that dynamically selects, compresses, and evicts modality-specific key-value entries to reduce memory usage by 80–90% while preserving accuracy.
- It leverages a three-stage process—attention redistribution, adaptive focusing, and cross-calibration—to achieve nearly lossless 2-bit quantization and a 10–15% speedup in inference.
- By dynamically calibrating per-layer attention profiles and applying channel-wise precision boosting, AccKV advances scalable and efficient multimodal model deployment.
AccKV refers to an Adaptive-Focusing and Cross-Calibration Key-Value (KV) cache optimization framework tailored for accelerating Audio-Video LLM (AV-LLM) inference. It operates by dynamically choosing, compressing, and evicting modality-specific KV cache entries with awareness of per-layer and cross-modal attention profiles, enabling aggressive KV cache reduction (80–90%) for long temporal contexts common to video and audio, while preserving accuracy and alignment. AccKV is also the name of a highly efficient 2-bit KV quantization scheme, integrated into systems such as Kitty, that employs dynamic channel-wise precision boosting for nearly lossless ultra-low-bit cache representation in high-throughput settings. The following sections present a technical overview of AccKV across both multimodal inference and quantization system contexts with explicit references to methodology, analysis, and empirical results (Jiang et al., 14 Nov 2025, Xia et al., 23 Nov 2025).
1. Motivation and Context: KV Cache Bottlenecks in AV-LLMs
In transformer-based LLM inference, autoregressive decoding requires access to the complete sequence of previously computed key-value pairs, maintained as the "KV cache". For conventional (text-only or single image) models, the memory and compute costs are proportional to sequence length : per new token. Audio-video scenarios involve long-form sequences across multiple modalities—audio and video—whose combined token counts rapidly escalate, creating prohibitive GPU memory and inference throughput demands. Naive strategies for cache retention (such as applying existing KV pruning techniques independently or merging non-discriminatively across modalities) are insufficient due to modal collapse, modality structure conflicts, and cross-alignment failure (Jiang et al., 14 Nov 2025).
2. Core Framework: AccKV’s Three-Stage Methodology
The AccKV framework, as instantiated for AV-LLMs, integrates three modules per transformer layer: Attention Redistribution, Adaptive-Focusing, and Cross-Calibration (Jiang et al., 14 Nov 2025).
- Attention Redistribution: Corrects the natural lower-triangular (positional) bias of the attention cumulative score:
where is the attention matrix, row (query), column (key), and is sequence length.
- Layer-wise Adaptive-Focusing: Computes, for each layer, modality scores for video () and audio () based on normalized aggregated attention to each modality. Subsequent attention weights are re-scaled accordingly to prioritize the modality dominating in a given layer:
- Cross-Calibration: Within each modality, selects top-N “heavy-hitter” key indices; merges remaining low-priority tokens within each modality; then, for the lower-priority modality, computes cosine similarity to anchors in the higher-priority modality and evicts low-similarity tokens. This preserves internal modal structure and yields modality-aligned, cross-calibrated cache reduction.
The stepwise algorithm is executed independently at each decoding step and per layer, and maintains only a tightly bounded set of curated key-value entries per token.
3. Theoretical and Empirical Analysis
AccKV's computational efficiency is quantified by comparing the total number of floating-point operations (FLOPs) in full-cache (0) vs. AccKV-pruned (1) decoding:
2
3
with the cost reduction 4 growing linearly with number of generated tokens 5 for a fixed retention budget, demonstrating large gains in long-sequence tasks (Jiang et al., 14 Nov 2025).
Empirical evaluation (VideoLLaMA2, AVicuna; multi-task audio-video MVBench, AVSD) reports:
- Under a 20% KV budget, AccKV maintains 6 of full-cache accuracy, outperforming all baselines by 5–30 points.
- At extreme 10% budgets or 120-token caches, performance degrade only 5–10% (whereas baselines “collapse”).
- Memory usage decreases 80–90%, with only 2–3% average performance drop.
- Inference speedup of 10–15%, e.g., 944 ms saved over 1000 generated tokens (Jiang et al., 14 Nov 2025).
Ablation shows each module (attention redistribution, adaptive focusing, cross-calibration) contributes additively to efficiency and accuracy.
4. AccKV in Channel-wise Quantization: The Kitty System
AccKV's second significant usage context is channel-wise quantized KV caching for long-context LLM inference. In systems such as Kitty (Xia et al., 23 Nov 2025), AccKV denotes a method that compresses the KV cache to near-2 bit per value precision by dynamically “boosting” only a small, channel-sorted subset to higher precision (4 bits):
- Channel sensitivity is measured by the magnitude of each channel, or via direct impact on attention scores.
- Top-K ranked channels (by importance) retain 4-bit quantization; the rest use 2 bits.
- A two-tensor decomposition encodes the mix: one dense 2-bit tensor, one sparse high-2-bits tensor plus index.
This paradigm achieves 8× memory reduction vs. FP16 with <3 percentage point accuracy loss (r=12.5%), and overhead is limited to small index arrays and tensor storage. The approach is integrated with page-level coalesced Triton-compatible kernels to sustain high-throughput, large-batch inference (up to 8× batch size and 2.1–4.1× token throughput increase) (Xia et al., 23 Nov 2025).
5. Distinguishing Features, Limitations, and Future Directions
Key contrasts between AccKV and prior approaches (H2O, SnapKV, LOOK-M, FastV, KIVI) include:
- Explicit per-layer cross-modal prioritization to prevent modal collapse.
- Intra-modality merge and inter-modality calibrated eviction, avoiding destructive mixing and excessive bias towards one modality.
- Heuristic but data-driven attention redistribution to achieve uniform heavy-hitter token selection.
Limitations and open questions:
- Cross-modal similarity thresholds (7) and modality budgets require careful (potentially automated) tuning.
- Generalization beyond A/V modalities (e.g., integrating subtitles, more general multimodal fusion) is unproven.
- Attention score redistribution, currently heuristic, could be learned for further gains.
- Hybrid-pruned and quantized caches (as in combining AccKV with channel-wise quantization) remain to be systematically explored (Jiang et al., 14 Nov 2025, Xia et al., 23 Nov 2025).
6. Broader Impact and Related Work
AccKV's innovations in careful, modality-aware and channel-wise allocation of limited cache resources set new scalability and efficiency standards for both AV-LLM and long-context LLM serving. In the multimodal context, it enables practical inference on resource-constrained hardware without incurring modal collapse or alignment loss. In the quantization domain, it closes the gap between high compression and task accuracy by leveraging dynamic sensitivity at channel granularity.
Comparisons against state-of-the-art systems validate its generalized applicability, as both a high-level cache policy (for AV-LMs) and a low-level quantization/encoding scheme (in Kitty). The separation of channel- and modal-prioritization, along with the unified system-level design, marks AccKV as a foundational element in current and next-generation efficient LLM deployment (Jiang et al., 14 Nov 2025, Xia et al., 23 Nov 2025).