Papers
Topics
Authors
Recent
Search
2000 character limit reached

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Published 31 Mar 2026 in cs.LG, cs.AI, and cs.DC | (2604.00235v1)

Abstract: Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: https://github.com/YJHMITWEB/MAC-Attention.git

Summary

  • The paper introduces MAC-Attention, a match-amend-complete framework that reduces KV accesses by up to 99% without sacrificing accuracy.
  • It decomposes attention into three stages, achieving 14.3xโ€“46x speedups in inference for long-context large language models.
  • The method is model-agnostic, integrating with existing kernels to deliver scalable, high-throughput performance improvements.

MAC-Attention: A Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

Introduction and Motivation

Long-context inference in LLMs introduces a substantial I/O bottleneck due to the repeated streaming and processing of expanding KV caches. Prior acceleration paradigmsโ€”primarily aggressive KV compression and selection/eviction strategiesโ€”lower KV memory and bandwidth footprints but compromise access fidelity or restrict long-range dependencies, thereby degrading performance on tasks requiring delayed recall, long-form reasoning, or robust cross-document retrieval. The solution presented in "MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation" (2604.00235) is a fidelity-preserving, model-agnostic approach that targets computational and I/O bottlenecks via semantic-level computation reuse within the canonical attention stack.

MAC-Attention: Algorithmic Structure

The MAC-Attention mechanism is driven by the observation that queries in generative decoding frequently demonstrate high self-similarity within a short horizon, particularly in long-form generation, multi-turn interaction, and systematic retrieval. MAC introduces an online per-request micro-pipeline, decomposed into three main algorithmic stages:

  1. Match: At each decode step, the pre-RoPE query vector is L2-compared within a constant-size, recent-horizon candidate ring. On a hit (when the L2 distance passes a dimension-aware threshold), a previously computed rectified attention summary is selected for reuse. Importantly, MAC matches in pre-RoPE space, not post-RoPE, significantly improving hit rates compared to phase-sensitive matching.
  2. Amend: To mitigate high-mass driftโ€”arising from positional encoding sensitivity and local recency biasโ€”a small prefix band (typically rโˆˆ[128,512]r \in [128, 512] tokens) around the reuse boundary is recomputed, ensuring that tokens with the largest softmax contribution are freshly processed. This correction step effectively bounds approximation error.
  3. Complete: The rectified prefix summary is then merged with a newly computed tail via log-domain summation, yielding the attention output in an associative and numerically stable manner. On a match, compute complexity reduces to O(1)O(1) for the prefix, regardless of sequence length. Figure 1

    Figure 1: Schematic of MAC-Attention's Match-Amend-Complete pipeline, showing how computation is skipped for reused prefixes, only amending a band and completing with a tail merge.

This pipeline is composable with existing attention kernels (e.g., FlashAttention, FlashInfer), paged-KV memory managers, and architectures with grouped or multi-query attention.

Systems Design and Implementation

MAC-Attention is implemented via per-request ring buffers for both query and rectified summary caching, enabling O(1)O(1) insertions and bounded auxiliary memory (O(K)O(K) per sequence, with Kโ‰ชLK \ll L where LL is the context length). Per-layer and per-head matching acceptances are highly variable, thus match thresholds and amendment spans can be layer-tuned for optimal reuse (Figure 2). Figure 3

Figure 3: Decode-phase micro-pipelineโ€”Match operates in parallel with subsequent amend/complete stages; kernels are load-balanced based on reuse span across query heads.

Systems-level efficiency is ensured by:

  • Dense, single-pass matching kernels that exploit vectorized (bf16/fp16) reads with fp32 accumulation.
  • Work flattening and CTA allocation proportional to rectification span, achieving near-perfect load balancing and mitigating overlapped IO-bound and compute-bound workloads.
  • Auxiliary stream scheduling for off-path summary construction.

Auxiliary memory overhead is typically โ‰ค5%\leq 5\% of the KV cache at K=1024K=1024 and L=120KL=120\text{K}, scaling sub-linearly with context length.

Empirical Evaluation

Quality-Preserving Efficiency

On LongBench v2 (120K), RULER (120K), and LongGenBench (16K), MAC:

  • Achieves KV access reductions of up to 99% compared to full attention,
  • Yields attention phase speedups of 14.3xโ€“46x and end-to-end speedups of up to 2.6x at 128K tokens (LLaMA-3.1-8B/70B),
  • Maintains or marginally surpasses full-attention accuracy across all metrics and tasks,
  • Outperforms state-of-the-art compression and selection baselines (e.g., Quest, RocketKV, Multipole) on both absolute accuracy and throughput (Figure 4). Figure 4

    Figure 4: Attention quality (accuracy) vs. KV budget on LongBench v2, RULER, and LongGenBench. MAC-Attention achieves top accuracy while reducing KV accesses by up to 99%.

Per-layer and per-head acceptance rates often exceed 99% (with proper thresholding), yielding near-constant-time prefix computation regardless of LL (Figure 2), and dominating the โ€œFull Attentionโ€ baseline for all context lengths under comparison. Figure 5

Figure 5: MAC-Attention's batch-size-scaled speedup across increasing context lengths. Speedup correlates with KV skip ratio, peaking at 46x at 256K.

Fidelity and Error Analysis

Numerical analysis of the rectification strategy demonstrates that amending a narrow prefix band (O(1)O(1)0) drives the normed output error to near-zero, with error decay matching the cumulative mass outside the rectified interval (Figure 6). Sensitivity to layer depth and window size further enables layerwise match/adaptivity. Figure 6

Figure 6: Layerwise heatmaps of rectification error versus reuse gap and band widthโ€”error drops rapidly as the band widens, with large gaps requiring slightly larger rectification.

Micro-Performance and Abstraction Overheads

Fine-grained latency profiling (Figure 7) confirms that the match kernel, amendment, and merging are fixed-cost components independent of O(1)O(1)1. Thus, for high-skip regimes (O(1)O(1)2), the overall MAC decode path is essentially flat in O(1)O(1)3, while baseline attention grows linearly. Figure 8

Figure 8: Full decode latency breakdown and MAC-Attention speedupโ€”largest gains accrue in attention, with overall speedup bottlenecked by non-attention phases as predicted by Amdahl's law.

Generality: MoE Models and Robustness

Evaluation on MoE-backed architectures (e.g., Qwen3-30B-A3B-Instruct) demonstrates that semantic redundancy exploited by MAC is not diminished by conditional computation or expert partitioning. Hit/skip rates and accuracy are virtually unaffected, emphasizing the agnosticism to architectural specialization.

Relation to Prior Work

MAC-Attention is distinct from prior I/O and compute reduction strategies:

  • Compression/Selection: Orthogonal to low-rank or quantization (e.g., PALU, LoRC) and selection/eviction (e.g., Quest, SnapKV), MAC does not discard or downsample context; all tokens remain accessible, and high-fidelity is approached asymptotically via local amendment.
  • Structural/Stepwise Reuse: It is independent from prefix or request-pair caching (e.g., PromptCache, DeFT) and statistical partial recycling (e.g., Recycled Attention).
  • Kernel-Accelerated: Composes natively with existing I/O-aware and paged-KV kernels, enhancing their wall-clock efficiency with no retraining or weight modification requirements.

Implications and Future Directions

Theoretical: MAC broadens the design space for sub-linear/deep-inference attention by amortizing compute and KV streaming costs in the temporal domain. It provides a rigorous, numerically stable route to O(1)O(1)4 decode complexity under strong match rates, while maintaining fallback to O(1)O(1)5 worst-case.

Practical: MAC is particularly suited for deployments requiring high-throughput, long-context serving on memory-bound hardware, as it is model-agnostic, training-free, and introduces negligible auxiliary compute/memory overhead.

Future trajectories include adaptive per-layer parameterization (e.g., learnable thresholds or dynamic band sizing), robust prefill integration (beyond decode), and hybrid schemes combining semantic reuse with lossy compression or token selection for even greater resource efficiency.

Conclusion

MAC-Attention establishes a new paradigm for accelerating long-context inference in LLMs: it leverages semantic-level temporal redundancy to amortize redundant compute and radically reduce memory traffic, all while preserving task fidelity and architectural compatibility. By introducing the Match-Amend-Complete pipeline, MAC achieves unprecedented end-to-end throughput improvements at true long context scales, without sacrificing accuracy or access. Its algorithmic simplicity, system robustness, and empirical superiority position it as a central component in the next generation of high-performance LLM serving stacks.

(2604.00235)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.