Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video Token Merging Techniques

Updated 30 June 2025
  • Video token merging is a family of methods that combine similar or redundant tokens in video transformers to reduce computational cost while preserving key information.
  • Techniques use similarity measures and saliency estimation to intelligently aggregate token representations across spatial and temporal dimensions.
  • These methods enable efficient long-video processing with significant speedup and memory reduction, supporting tasks like retrieval, classification, and video generation.

Video token merging is a family of methods for reducing the computational cost of video transformer models by intelligently combining similar or redundant tokens throughout a video sequence, rather than processing every patch or frame at a fixed granularity. These techniques are central to scaling video transformers to long sequences and enabling more efficient deployment while preserving accuracy.

1. Fundamental Principles and Goals

Video transformers operate on sequences of tokens—typically patch embeddings from each frame—resulting in quadratic growth in computation and memory with sequence length and spatial size. Many of these tokens are highly redundant, especially across similar frames or static regions. Video token merging algorithms exploit this redundancy by combining (merging) tokens that are similar in their representation, effectively compressing the sequence in both spatial and temporal dimensions.

Key goals of video token merging:

  • Reduce the number of tokens entering costly self-attention modules.
  • Maintain or minimize loss of accuracy by preserving semantically important tokens or aggregating information rather than simply dropping content.
  • Enable faster inference and/or training with existing transformer architectures, often in a plug-and-play, training-free fashion.

2. Core Methodologies

2.1 Layerwise Bipartite Soft Matching and Merging

Representative of this approach is ToMe (2210.09461), where, at each transformer block, tokens are split into two sets (e.g., by index), and the most similar pairs (via cosine similarity of attention keys) across the sets are selected and merged. This is repeated per layer, progressively reducing the sequence length.

Mathematically:

  • Given token representations XRN×DX \in \mathbb{R}^{N \times D}, split into sets A,BA, B.
  • For each aiAa_i \in A, find arg maxbjB\operatorname*{arg\,max}_{b_j \in B} of similarity.
  • Select top rr pairs and replace them by weighted averages based on how many source patches each represents.
  • Track token "size" ss to adjust subsequent attention scores (proportional attention).

This merging can be performed not only spatially, but also temporally, to compress across adjacent frames when redundancy is highest.

2.2 Saliency-Aware and Context-Adaptive Token Merging

Later methods (e.g., vid-TLDR (2403.13347), Learnable VTM (2410.23782)) improve upon similarity-based approaches by explicitly estimating token saliency—determining which tokens represent important content (e.g., foreground or moving objects) using mechanisms such as attention entropy or small neural networks. Saliency is used to:

  • Drop background or uninformative tokens early.
  • Weight token merges so that salient tokens dominate the merged representation.
  • Dynamically adapt the merge rate, preserving more tokens in complex or action-rich regions.

Saliency can be computed via:

  • Attention sharpness: Negative entropy of early attention distributions.
  • Learned scores: Small MLPs trained to predict oracle token importance.
  • Application-specific proxies: e.g., Classifier-free guidance magnitudes in diffusion models (2411.16720).

2.3 Temporal and Semantic Token Merging

Frameworks such as TempMe (2409.01156) and PruneVid (2412.16117) focus on temporal redundancy, merging similar tokens across frames or over entire segments where content remains mostly static. More recent paradigms, such as trajectory-based tokenization (TrajViT (2505.23617)), group tokens according to object or part trajectories instead of spatio-temporal patches. This yields token counts that scale with scene complexity rather than video length.

QuoTA (2503.08689) introduces task- or query-aware token assignment by scoring the relevance of each frame or object segment to a specific instruction, allocating more tokens only to crucial segments.

3. Impact on Computational Efficiency and Model Performance

Methods such as ToMe have demonstrated up to 2.2× throughput increase for ViT-L on video with negligible (\sim0.3%) accuracy drop on Kinetics-400 (2210.09461), and similar accelerations for other transformers (ViViT, VideoMAE, TimeSformer) with marginal losses (2506.03885).

Saliency-aware methods (vid-TLDR, Learnable VTM) routinely provide up to 84% memory reduction and 6.89× throughput gains on long-form video without harming, and often improving, test set accuracy (2410.23782, 2403.13347).

Trajectory- and query-driven schemes (TrajViT, QuoTA) enable even more aggressive reduction (up to 10×–18× fewer tokens (2505.23617)) while either matching or exceeding previous SOTA accuracy—particularly on long or complex video tasks—including improvements in retrieval, classification, and generative video tasks.

Token merging differs fundamentally from pruning: rather than discarding tokens outright (which risks loss of crucial content and variable tensor shapes for batching), merging aggregates information from redundant patches or frames into fewer, more informative tokens. This results in:

  • Higher accuracy under strong compression: Merged representations integrate features, avoiding information loss typical of pruning (2210.09461).
  • Determinate sequence length: Output token count can be controlled and remains batch-friendly.
  • Plug-and-play deployment: Most schemes require no retraining and can be applied to pretrained transformers off-the-shelf.
  • Superior speed-accuracy trade-off: Outperforms SOTA pruning and clustering strategies both in throughput and in top-1/top-5 accuracy across benchmarks.

Table: Comparison of Key Characteristics

Method Merge/Prune Saliency/Similarity Training Required Accuracy Retention Efficiency
ToMe (2210.09461) Merging Similarity (K) No High (−0.3%) 2×–2.2×
vid-TLDR (2403.13347) Merging Attention sharpness No None/Improved Up to 70% FLOPs cut
Pruning (DynamicViT) Pruning Learned/Rule-based Yes Lower Good
TrajViT (2505.23617) Merging Object trajectories Yes (pretrain) +6% over ViT3D 10x-18x efficiency

5. Applications and Extensions

Video token merging has broad utility across domains:

  • Efficient long video understanding: Enables multi-minute or high frame-rate analysis (e.g., in LVU or COIN) (2410.23782).
  • Video LLMs (VideoLLMs): Scalable processing of long sequences with reduced context bottleneck (HoliTom (2505.21334), Token Dynamics (2503.16980)).
  • Video generation and editing with diffusion models: Lower latency and improved temporal consistency by merging temporal tokens during denoising (VidToMe (2312.10656), ReToMe-VA (2408.05479), Importance-based merging (2411.16720)).
  • Text-video retrieval: Reduces inference complexity and memory, permits higher throughput and larger batch sizes (TempMe (2409.01156)).
  • Zero-shot and efficient multimodal QA: Query-aware merging (QuoTA (2503.08689), AIM (2412.03248)) allows resource allocation focused on relevant segments for video question answering and dialogue.

6. Challenges, Best Practices, and Future Directions

Scheduling and Placement: Merging schedules (constant, increasing-in-layer) and layer placement are critical. Best practices suggest merging up to 10% per layer (60% total) produces minimal accuracy loss with maximal speedup (2506.03885), and early-layer merging provides the largest efficiency gains.

Saliency Estimation: The reliability of learned or derived saliency measures impacts performance. Attention sharpness and task-driven or query-driven scoring have yielded robust improvements; proxy cues (visual saliency, simple motion) are less predictive of oracle token value (2411.13626).

Compositional Merging: Combining outer-LLM (pre-LM) and inner-LLM (within-LM) merging—when carefully integrated—yields further redundancy reduction (to as little as 6.9% of original FLOPs) with negligible performance drop (2505.21334).

Limitations and Open Problems: Some dynamic merging methods require additional computation for information density estimation, and very aggressive merging can harm small-object or fine-grained recognition. The learning of token value, even with sophisticated visual cues, remains challenging (2411.13626). Scaling token merging to fully online, streaming, and cross-modal video tasks is an active area for future research.

7. Summary Table: Video Token Merging—Selected Benchmarks

Paper (Model/Method) Token Reduction Accuracy Δ Throughput/FLOPs Gain Video Task
ToMe (2210.09461) 96–98% tokens −0.3% 2×–2.2× Kinetics-400, AudioSet
vid-TLDR (2403.13347) Up to 50% FLOPs 0% or + Up to 70% FLOPs ↓ VideoCLIP/Retrieval
TempMe (2409.01156) 95% fewer tokens +4.4–7.9% 1.8–13.7× speedup Text-Video Retrieval
HoliTom (2505.21334) 90% tokens −0.9% 2.28× TTFT; 1.32× dec. VideoLLM Benchmarks
TrajViT (2505.23617) 10× fewer tokens +6% R@5 4×–18× faster Retrieval, QA, VideoLLM

Video token merging has emerged as a foundational approach for resource-efficient video transformer modeling, supporting advances in large-scale video understanding, real-time inference, and generalizable multimodal applications across the spectrum of video AI research.