Token Temporal Merging (TTM)
- Token Temporal Merging is a method that collapses redundant tokens over time in sequence models to improve efficiency.
- TTM enhances computational and memory performance in video, time series, and 3D scene tasks with minimal performance loss.
- Variants employ cosine similarity, adaptive thresholds, and dynamic programming for effective plug-and-play token reduction.
Token Temporal Merging (TTM) is a class of algorithms, architectures, and theoretical techniques for reducing token count in sequence models by collapsing tokens deemed redundant across the temporal dimension. TTM aims to exploit the high degree of temporal correlation in video, multi-frame, time series, and related sequential transformer workloads. Across video understanding, generation, retrieval, time-series modeling, and 3D scene reconstruction, TTM achieves substantial reductions in computational and memory cost—often with minimal loss, and at times measurable gains, in qualitative and quantitative performance. State-of-the-art TTM variants rely on similarity metrics (typically cosine similarity), adaptive schedules or thresholds, spatio-temporal heuristics, or dynamic program-based segmentations to decide when and which tokens to merge, and are notable both for their training-free (plug-and-play) nature and their compatibility with existing transformer backbones.
1. Principles and Formulations of Token Temporal Merging
TTM generalizes the “token merging” idea (as in ToMe for images) to temporal or spatio-temporal data. For a set of embeddings , where is frames (or time steps), spatial tokens, and channel dimension, TTM identifies groups of tokens across time that remain highly similar under some criterion, and merges them to a single representative. This can be formalized as matching token pairs or chains across time using
where merging proceeds if exceeds a layer- or schedule-dependent threshold. In most architectures, the merged representation is the (possibly weighted) average:
where denotes tokens deemed redundant over a temporal window.
TTM can employ fixed merge budgets, adaptive thresholds (e.g., 0.7 quantile of similarities per step (Saghatchian et al., 1 Jan 2025)), or dynamic-programming-based global segmentation (Shao et al., 27 May 2025). Merging is most efficient when applied in early/fat layers (e.g., the topmost U-net blocks or pre-LLM visual token staging) but also enables inner-layer or inner-head reduction (Wang et al., 26 Nov 2025).
2. Algorithmic Variants and Implementation Details
Multiple TTM instantiations have emerged, tailored to differing sequence modeling domains:
- Progressive Multi-Granularity (PMG): Alternates fine-grained spatial (frame-level) and coarse temporal (clip-level) merging in blockwise stages (TempMe (Shen et al., 2024)). Spatial merging first reduces tokens within frames, followed by cross-clip and intra-clip merges—each controlled by explicit retain ratios tuned per stage.
- Global Redundancy-Aware Segmentation: Uses dynamic programming over the sequence to segment frame ranges into “high similarity” chunks, maximizing prunable tokens (HoliTom (Shao et al., 27 May 2025)). Within each segment, temporal merging eliminates tokens redundant across all frames; non-redundant tokens are further subjected to spatial/cluster merging.
- Headwise and Blocklocal TTM: Merges tokens independently in each attention head, after reordering tokens into space-time blocks (HTTM (Wang et al., 26 Nov 2025)). This avoids feature collapse across heads and achieves high merge ratios at minimal cost by exploiting local temporal coherency, with complexity (where is block size).
Pseudocode Abstractions
Below is an archetype of one TTM layer processing step:
1 |
Advanced variants cache merge-pairs across denoising steps (CA-ToMe (Saghatchian et al., 1 Jan 2025)), use union–find to build tracklets (STTM (Hyun et al., 10 Jul 2025)), or merge only within causal windows for time-series decoders (Götz et al., 2024).
3. Integration into Model Architectures
TTM is highly modular and appears in diverse positions within modern deep learning architectures:
- Video Transformers: TTM can be plugged either after the patch embedding or after each transformer block. In joint space-time models (ViViT, VideoMAE), tokens from all frames are eligible for merging in a global stage, whereas divided models (TimeSformer) perform TTM per frame for compatibility with spatial and temporal attention factoring (Pollard et al., 4 Jun 2025).
- Diffusion Models: CA-ToMe applies TTM before MHSA in the topmost encoder/decoder UNet blocks, using adaptive similarity thresholds and caching merge-pairs to minimize redundant computation (Stable Diffusion v1.5: 1.24× speedup, minimal FID change (Saghatchian et al., 1 Jan 2025)). VidToMe interleaves chunkwise local and global token alignment, merging, and delayed unmerging for efficient video editing with strong temporal consistency (Li et al., 2023).
- Video LLMs: HoliTom and STTM both integrate outer-LLM TTM for aggressive pre-LLM token downsampling, followed by finer inner-LLM merging (Shao et al., 27 May 2025, Hyun et al., 10 Jul 2025). KV-cache reuse is possible due to TTM’s query-agnostic nature.
- Time Series and SSMs: Local merging performs TTM over a bounded window of size , enforcing causal constraints in decoder blocks (no future-to-past merges), and interpolating between quadratic and linear cost (Götz et al., 2024).
4. Computational and Practical Benefits
TTM addresses the quadratic scaling of self-attention and broad memory bottlenecks:
| Model/System | Token Reduction | Throughput Gain | Accuracy Loss |
|---|---|---|---|
| TempMe (ViT-B/16, 12 fr) | 95% | 1.8× | +5.3 R-Sum |
| CA-ToMe (diffusion) | ~30% | 1.24× | ΔFID <0.4 |
| Video-TTM, ViViT | ~60% | 2.5× | <1% |
| HoliTom (video LLM) | 93% (prefill) | 2.3× TTFT, 1.3× decode | 0.9% |
| VidToMe (video editing) | ~60% (mem.) | ~3× | -- (improved consistency) |
TTM generally yields $1.3$– speedup on video transformers, $1.2$– on 3D scene reconstruction, up to on large time-series foundation models, and $2$– memory savings in multi-frame diffusion models. Fine-tuning or retraining is often not required; in some cases, inference-only merging acts as a low-pass filter in time series, actually improving prediction error metrics when high-frequency noise dominates (Götz et al., 2024).
5. Design Choices, Schedules, and Limitations
TTM exposes several key degrees of freedom:
- Merge Budget and Scheduling: Merge ratio per layer governs the speed–accuracy tradeoff: constant, increasing (late), and decreasing (early) schedules yield varying loss profiles, with increasing schedules being safest for fine-grained tasks (Pollard et al., 4 Jun 2025).
- Similarity Thresholding: Adaptive, per-step or global similarity thresholds (e.g., in CA-ToMe) control the aggressiveness of merging (Saghatchian et al., 1 Jan 2025).
- Block and Head Granularity: Merging can be performed at the block-level in spatio-temporal space, and independently per head to preserve representational diversity (Wang et al., 26 Nov 2025).
- Window and Causality Constraints: In time-series, merging is typically restricted to local neighborhoods and causal orders to prevent violation of temporal dependencies (Götz et al., 2024).
Limitations include possible loss of detail in highly dynamic content when merging is too aggressive, difficulty handling arbitrary sequence lengths or scene changes in fixed-schedule frameworks, and—except in architectures like STTM or HoliTom—no built-in mechanism for global, content-sensitive scheduling (Shao et al., 27 May 2025, Hyun et al., 10 Jul 2025). TTM in decoders remains a less-explored area, with causal merging mandatory to preserve autoregressive correctness (Götz et al., 2024).
6. Empirical Evaluation and Comparisons
Comprehensive benchmarking of TTM variants demonstrates systematic advantages over hard-dropping or random replacement baselines. In video understanding and action recognition, attention-based TTM outperforms token dropping by ~4–5% accuracy for the same reduction (Pollard et al., 4 Jun 2025). TempMe improves GFLOPS and memory more than ToMe at equivalent accuracy, and HoliTom achieves near-lossless performance (99.1% retention) at <7% FLOPs (Shen et al., 2024, Shao et al., 27 May 2025). VidToMe establishes that TTM is critical for eliminating “flicker” and ensuring cross-frame consistency in diffusion-based video editing, as ablation of TTM steps sharply degrades interpolation and perceptual metrics (Li et al., 2023). In time series, spectral analysis reveals classes of problems (with redundant or low-rank dynamics) for which moderate token merging can even improve downstream error (Götz et al., 2024).
7. Extensions and Future Directions
Emerging research seeks to (1) further generalize TTM with content-aware or adaptive length scheduling (Shen et al., 2024), (2) fuse or replace averaging-based merges with more flexible, learned gating for potentially superior information retention, (3) exploit per-head and per-block adaptivity for multi-modal or scene-graph models (Wang et al., 26 Nov 2025), (4) extend to tasks such as video question answering and captioning (TempMe, STTM), and (5) analyze TTM’s relation to low-pass filtering and data compressibility via spectral signatures (Götz et al., 2024). A plausible implication is that further harmonizing spectral analysis with dynamic merging schedules may enable even greater efficiency with minimal accuracy loss, particularly in streaming or real-time deployment settings.
References
- (Saghatchian et al., 1 Jan 2025): Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model
- (Shao et al., 27 May 2025): HoliTom: Holistic Token Merging for Fast Video LLMs
- (Shen et al., 2024): TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval
- (Pollard et al., 4 Jun 2025): Video, How Do Your Tokens Merge?
- (Wang et al., 26 Nov 2025): HTTM: Head-wise Temporal Token Merging for Faster VGGT
- (Götz et al., 2024): Efficient Time Series Processing for Transformers and State-Space Models through Token Merging
- (Hyun et al., 10 Jul 2025): Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
- (Li et al., 2023): VidToMe: Video Token Merging for Zero-Shot Video Editing