Papers
Topics
Authors
Recent
2000 character limit reached

Head-Wise Temporal Merging (HTTM)

Updated 4 December 2025
  • HTTM is a technique in multi-head attention that merges head-level temporal and spatial representations to reduce computational and memory overhead.
  • It leverages hyper-networks, block reordering, and adaptive outlier filtering to preserve the diversity of attention heads while enhancing efficiency.
  • Empirical results in tasks like translation, summarization, ASR, and 3D scene reconstruction show substantial speedups and memory reductions with minimal accuracy loss.

Head-Wise Temporal Merging (HTTM) encompasses a family of techniques for reducing the temporal and spatial memory and compute footprint in attention mechanisms, with particular emphasis on multi-head architectures. Characterized by per-head granularity in token or latent vector merging, HTTM preserves diversity across attention heads while exploiting both spatial locality and temporal correlation within input sequences. This enables substantial acceleration and memory savings in tasks that involve long input sequences, such as speech recognition, machine translation, summarization, and 3D scene reconstruction. Recent implementations include the Multi-Head Temporal Latent Attention (MTLA) framework for sequence modeling (2505.13544) and the HTTM algorithm for Visual Geometry Grounded Transformer (VGGT) in 3D computer vision (Wang et al., 26 Nov 2025).

1. Mathematical Foundations of Head-Wise Temporal Merging

HTTM techniques operate over multi-head attention mechanisms by merging representations at the head level, rather than uniformly across all heads. In MTLA, temporal latent compression is accomplished as follows (2505.13544):

Given input xt∈Rd\bm x_t\in\mathbb R^d at time tt, a low-dimensional latent vector ct∈Rr\bm c_t\in\mathbb R^r is computed: ct=xtWr,Wr∈Rd×r\bm c_t = \bm x_t W_r, \quad W_r\in\mathbb R^{d\times r} For merging, a small hyper-network produces a weight wiw_i for temporal position ii: wi=σ((ciWc)⊙(pe⌈i/s⌉Wp))∈(0,1)w_i = \sigma\left((\bm c_i W_c)\odot(\bm{pe}_{\lceil i/s\rceil}W_p)\right)\in(0,1) where WcW_c, WpW_p are trainable, ⊙\odot is elementwise multiplication.

With stride ss, a merged latency block c^j\hat{\bm c}_j is formed: c^j=∑i=(j−1)s+1jswi ci,j=1,…,⌈Ts⌉\hat{\bm c}_j = \sum_{i=(j-1)s+1}^{js} w_i\,\bm c_i, \quad j=1,\dots,\left\lceil\frac{T}{s}\right\rceil Keys and values for attention (K,VK,V) are unfolded via per-head up-projection: K=C^WK,V=C^WVK = \hat{\bf C} W_K,\qquad V = \hat{\bf C} W_V In VGGT, HTTM operates directly on the multi-head token matrices Q,K,V∈Rh×N×dheadQ,K,V\in\mathbb R^{h\times N\times d_{head}}, where head-wise merging is performed on temporally reordered blocks (Wang et al., 26 Nov 2025).

2. Algorithmic Design and Workflow

MTLA merges temporally adjacent latents in each attention head via a hyper-network, then up-projects merged representations to obtain head-wise keys and values. The stride-aware causal mask MM maintains training/inference consistency in sequence-to-sequence modeling: Mt,j={0,  j≤⌊(t−1)/s⌋ 0,  j=⌈t/s⌉ −∞,otherwiseM_{t,j} = \begin{cases} 0,\; & j \leq \lfloor (t-1)/s \rfloor\ 0,\; & j = \lceil t/s \rceil\ -\infty, & \text{otherwise} \end{cases} This ensures each query attends only to valid merged blocks.

In VGGT, HTTM applies a bipartite soft-matching scheme per attention head:

  • Temporal and spatial block partitioning of tokens
  • Merging via similarity-based pairing and averaging, preserving head-level diversity
  • Adaptive outlier filtering prevents degradation from merging non-similar tokens

Table: Core distinctions of HTTM in MTLA and VGGT

Variant Merge Granularity Temporal Block Outlier Handling
MTLA (seq tasks) Latent per head Fixed stride None
VGGT (vision) Token per head Spatio-temp Adaptive filtering

3. Computational Complexity and Memory Analysis

HTTM yields near-linear reductions in compute and storage under high merging ratios.

In MTLA (2505.13544):

  • Standard MHA requires O(L×Hdh)O(L\times H d_h) per step and stores 2LHdh2L H d_h for the key-value cache.
  • With stride ss, MTLA uses O((L/s)×Hdh)O((L/s)\times H d_h) attention time and stores 2(L/s)Hdh2(L/s)H d_h or Lr/sLr/s.

In VGGT (Wang et al., 26 Nov 2025):

  • Full global attention has O(N2d)O(N^2 d) FLOPs.
  • HTTM reduces this to: O(Nnbdhead+h((1−r)N)2dhead)O(N n_b d_{head} + h ( (1-r)N )^2 d_{head}) where nbn_b is block size and rr is merge ratio. For r=0.8r=0.8, the dominant attention cost drops to 4%4\% of original. Block-wise matching is >4.5×>4.5\times faster than baseline methods.

4. Implementation Strategies

MTLA integrates merging and unmerging within the attention cache, using stride-wise blocks and a lightweight hyper-network. The block prefix-sum accumulates latent states at inference incrementally. Fusion into output matrices further reduces compute steps.

HTTM in VGGT is applied as a plug-in at inference, requiring no retraining:

  • Temporal/spatial reordering is achieved by block enumeration (block size ns=128n_s=128, window nt=30n_t=30).
  • Two custom CUDA kernels handle block-wise similarity, matching, merge/unmerge operations, and outlier filtering.
  • FlashAttention and BF16 types are employed for efficient GPU execution.

Parameter choices (rq=0.90r_q=0.90, rkv=0.70r_{kv}=0.70, outlier budget d=10%d=10\%) are optimized along a Pareto front for merge quality versus cost.

5. Empirical Assessment Across Tasks

Across both sequence and vision applications, HTTM demonstrates minimal trade-off in output performance against large efficiency gains.

MTLA (2505.13544) empirical results (all on RTX 6000 Ada GPU, default s=2s=2):

Task Metric MHA MTLA (s=2s=2) Speedup Memory Reduction
En→De translation BLEU↑ 23.18, 281s, 18.6GiB 23.28, 65.6s, 2.8GiB 4.29× 6.58×
XSum summarisation ROUGE-L↑ 23.33, 352s, 16.1GiB 23.60, 105s, 2.2GiB 3.35× 7.34×
AMI ASR WER↓ 12.98%, 269s, 17.5GiB 12.66%, 71.8s, 2.4GiB 3.75× 7.41×

Ablation shows negligible loss through s=4s=4, with further speed and memory improvement.

VGGT+HTTM (Wang et al., 26 Nov 2025) 3D reconstruction accuracy matches the full model (≤1–2mm error) and occasionally surpasses FastVGGT for high-resolution dense scenes. Long-sequence speedup reaches 7×7\times on 1000-frame 3D scenes. Adaptive outlier filtering is essential for preventing catastrophic representational degradation.

6. Significance and Implications

HTTM establishes a framework for scalable, fast attention in domains with substantial temporal or spatial redundancy. The technique's training-free deployment is especially valuable for large pretrained systems such as VGGT or speech transformers. Head-wise merging preserves head diversity, avoiding limitations of prior uniform merging schemes that potentially sacrifice expressiveness. Block-wise temporal reordering and outlier filtering further enhance merge quality, supporting highly aggressive reduction ratios with minimal accuracy compromise.

A plausible implication is that HTTM methodologies can extend to other multi-head architectures, including those dealing with multimodal fusion, large video understanding, and real-time sequence tasks, wherever block-level temporal and spatial coherence exist. The decoupling of merging from retraining makes site-specific system acceleration practical for production contexts. The observed Pareto front between merge-cost and merge-quality provides a principled basis for system designers to balance resource constraints and fidelity.

7. Common Misconceptions and Limitations

It is a misconception that merging tokens or latent vectors in attention mechanisms necessarily degrades expressiveness; HTTM overcomes this by head-wise similarity matching and adaptive outlier protection. The preservation of distinct output embeddings per head is a direct consequence of per-head merging, distinguishing HTTM from schemes that merge tokens uniformly across all heads—such uniform merging often yields near-duplicate head outputs and loss of representational richness.

Outlier filtering is non-optional for high merge ratios, as merging non-similar tokens causes severe deterioration in feature fidelity. Temporal blockwise merging leverages scene or sequence continuity, so highly non-coherent or unordered inputs may benefit less unless block parameters are appropriately tuned.

HTTM does not reduce model parameter count, and its main advantage is memory/FLOP savings at inference. Merging strategies must be calibrated to the characteristics of the input data for optimal efficacy.


HTTM, as formalized in (2505.13544) and (Wang et al., 26 Nov 2025), enables efficient, fidelity-preserving attention over long sequences and large spatial fields in neural architectures, demonstrating plug-in applicability for substantial acceleration with negligible or absent quality loss.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Head-Wise Temporal Merging (HTTM).