Head-Wise Temporal Merging (HTTM)

Updated 4 December 2025

HTTM is a technique in multi-head attention that merges head-level temporal and spatial representations to reduce computational and memory overhead.
It leverages hyper-networks, block reordering, and adaptive outlier filtering to preserve the diversity of attention heads while enhancing efficiency.
Empirical results in tasks like translation, summarization, ASR, and 3D scene reconstruction show substantial speedups and memory reductions with minimal accuracy loss.

Head-Wise Temporal Merging (HTTM) encompasses a family of techniques for reducing the temporal and spatial memory and compute footprint in attention mechanisms, with particular emphasis on multi-head architectures. Characterized by per-head granularity in token or latent vector merging, HTTM preserves diversity across attention heads while exploiting both spatial locality and temporal correlation within input sequences. This enables substantial acceleration and memory savings in tasks that involve long input sequences, such as speech recognition, machine translation, summarization, and 3D scene reconstruction. Recent implementations include the Multi-Head Temporal Latent Attention (MTLA) framework for sequence modeling (2505.13544) and the HTTM algorithm for Visual Geometry Grounded Transformer (VGGT) in 3D computer vision (Wang et al., 26 Nov 2025).

1. Mathematical Foundations of Head-Wise Temporal Merging

HTTM techniques operate over multi-head attention mechanisms by merging representations at the head level, rather than uniformly across all heads. In MTLA, temporal latent compression is accomplished as follows (2505.13544):

Given input $\bm x_t\in\mathbb R^d$ at time $t$ , a low-dimensional latent vector $\bm c_t\in\mathbb R^r$ is computed: $\bm c_t = \bm x_t W_r, \quad W_r\in\mathbb R^{d\times r}$ For merging, a small hyper-network produces a weight $w_i$ for temporal position $i$ : $w_i = \sigma\left((\bm c_i W_c)\odot(\bm{pe}_{\lceil i/s\rceil}W_p)\right)\in(0,1)$ where $W_c$ , $W_p$ are trainable, $\odot$ is elementwise multiplication.

With stride $s$ , a merged latency block $\hat{\bm c}_j$ is formed: $\hat{\bm c}_j = \sum_{i=(j-1)s+1}^{js} w_i\,\bm c_i, \quad j=1,\dots,\left\lceil\frac{T}{s}\right\rceil$ Keys and values for attention ( $K,V$ ) are unfolded via per-head up-projection: $K = \hat{\bf C} W_K,\qquad V = \hat{\bf C} W_V$ In VGGT, HTTM operates directly on the multi-head token matrices $Q,K,V\in\mathbb R^{h\times N\times d_{head}}$ , where head-wise merging is performed on temporally reordered blocks (Wang et al., 26 Nov 2025).

2. Algorithmic Design and Workflow

MTLA merges temporally adjacent latents in each attention head via a hyper-network, then up-projects merged representations to obtain head-wise keys and values. The stride-aware causal mask $M$ maintains training/inference consistency in sequence-to-sequence modeling: $M_{t,j} = \begin{cases} 0,\; & j \leq \lfloor (t-1)/s \rfloor\ 0,\; & j = \lceil t/s \rceil\ -\infty, & \text{otherwise} \end{cases}$ This ensures each query attends only to valid merged blocks.

In VGGT, HTTM applies a bipartite soft-matching scheme per attention head:

Temporal and spatial block partitioning of tokens
Merging via similarity-based pairing and averaging, preserving head-level diversity
Adaptive outlier filtering prevents degradation from merging non-similar tokens

Table: Core distinctions of HTTM in MTLA and VGGT

Variant	Merge Granularity	Temporal Block	Outlier Handling
MTLA (seq tasks)	Latent per head	Fixed stride	None
VGGT (vision)	Token per head	Spatio-temp	Adaptive filtering

3. Computational Complexity and Memory Analysis

HTTM yields near-linear reductions in compute and storage under high merging ratios.

In MTLA (2505.13544):

Standard MHA requires $O(L\times H d_h)$ per step and stores $2L H d_h$ for the key-value cache.
With stride $s$ , MTLA uses $O((L/s)\times H d_h)$ attention time and stores $2(L/s)H d_h$ or $Lr/s$ .

In VGGT (Wang et al., 26 Nov 2025):

Full global attention has $O(N^2 d)$ FLOPs.
HTTM reduces this to: $O(N n_b d_{head} + h ( (1-r)N )^2 d_{head})$ where $n_b$ is block size and $r$ is merge ratio. For $r=0.8$ , the dominant attention cost drops to $4\%$ of original. Block-wise matching is $>4.5\times$ faster than baseline methods.

4. Implementation Strategies

MTLA integrates merging and unmerging within the attention cache, using stride-wise blocks and a lightweight hyper-network. The block prefix-sum accumulates latent states at inference incrementally. Fusion into output matrices further reduces compute steps.

HTTM in VGGT is applied as a plug-in at inference, requiring no retraining:

Temporal/spatial reordering is achieved by block enumeration (block size $n_s=128$ , window $n_t=30$ ).
Two custom CUDA kernels handle block-wise similarity, matching, merge/unmerge operations, and outlier filtering.
FlashAttention and BF16 types are employed for efficient GPU execution.

Parameter choices ( $r_q=0.90$ , $r_{kv}=0.70$ , outlier budget $d=10\%$ ) are optimized along a Pareto front for merge quality versus cost.

5. Empirical Assessment Across Tasks

Across both sequence and vision applications, HTTM demonstrates minimal trade-off in output performance against large efficiency gains.

MTLA (2505.13544) empirical results (all on RTX 6000 Ada GPU, default $s=2$ ):

Task	Metric	MHA	MTLA ( $s=2$ )	Speedup	Memory Reduction
En→De translation	BLEU↑	23.18, 281s, 18.6GiB	23.28, 65.6s, 2.8GiB	4.29×	6.58×
XSum summarisation	ROUGE-L↑	23.33, 352s, 16.1GiB	23.60, 105s, 2.2GiB	3.35×	7.34×
AMI ASR	WER↓	12.98%, 269s, 17.5GiB	12.66%, 71.8s, 2.4GiB	3.75×	7.41×

Ablation shows negligible loss through $s=4$ , with further speed and memory improvement.

VGGT+HTTM (Wang et al., 26 Nov 2025) 3D reconstruction accuracy matches the full model (≤1–2mm error) and occasionally surpasses FastVGGT for high-resolution dense scenes. Long-sequence speedup reaches $7\times$ on 1000-frame 3D scenes. Adaptive outlier filtering is essential for preventing catastrophic representational degradation.

6. Significance and Implications

HTTM establishes a framework for scalable, fast attention in domains with substantial temporal or spatial redundancy. The technique's training-free deployment is especially valuable for large pretrained systems such as VGGT or speech transformers. Head-wise merging preserves head diversity, avoiding limitations of prior uniform merging schemes that potentially sacrifice expressiveness. Block-wise temporal reordering and outlier filtering further enhance merge quality, supporting highly aggressive reduction ratios with minimal accuracy compromise.

A plausible implication is that HTTM methodologies can extend to other multi-head architectures, including those dealing with multimodal fusion, large video understanding, and real-time sequence tasks, wherever block-level temporal and spatial coherence exist. The decoupling of merging from retraining makes site-specific system acceleration practical for production contexts. The observed Pareto front between merge-cost and merge-quality provides a principled basis for system designers to balance resource constraints and fidelity.

7. Common Misconceptions and Limitations

It is a misconception that merging tokens or latent vectors in attention mechanisms necessarily degrades expressiveness; HTTM overcomes this by head-wise similarity matching and adaptive outlier protection. The preservation of distinct output embeddings per head is a direct consequence of per-head merging, distinguishing HTTM from schemes that merge tokens uniformly across all heads—such uniform merging often yields near-duplicate head outputs and loss of representational richness.

Outlier filtering is non-optional for high merge ratios, as merging non-similar tokens causes severe deterioration in feature fidelity. Temporal blockwise merging leverages scene or sequence continuity, so highly non-coherent or unordered inputs may benefit less unless block parameters are appropriately tuned.

HTTM does not reduce model parameter count, and its main advantage is memory/FLOP savings at inference. Merging strategies must be calibrated to the characteristics of the input data for optimal efficacy.

HTTM, as formalized in (2505.13544) and (Wang et al., 26 Nov 2025), enables efficient, fidelity-preserving attention over long sequences and large spatial fields in neural architectures, demonstrating plug-in applicability for substantial acceleration with negligible or absent quality loss.

PDF Markdown Chat (Pro)

References (2)

Multi-head Temporal Latent Attention (2025)

HTTM: Head-wise Temporal Token Merging for Faster VGGT (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Head-Wise Temporal Merging (HTTM).