Video Token Merging Techniques
- Video token merging is a family of methods that combine similar or redundant tokens in video transformers to reduce computational cost while preserving key information.
- Techniques use similarity measures and saliency estimation to intelligently aggregate token representations across spatial and temporal dimensions.
- These methods enable efficient long-video processing with significant speedup and memory reduction, supporting tasks like retrieval, classification, and video generation.
Video token merging is a family of methods for reducing the computational cost of video transformer models by intelligently combining similar or redundant tokens throughout a video sequence, rather than processing every patch or frame at a fixed granularity. These techniques are central to scaling video transformers to long sequences and enabling more efficient deployment while preserving accuracy.
1. Fundamental Principles and Goals
Video transformers operate on sequences of tokens—typically patch embeddings from each frame—resulting in quadratic growth in computation and memory with sequence length and spatial size. Many of these tokens are highly redundant, especially across similar frames or static regions. Video token merging algorithms exploit this redundancy by combining (merging) tokens that are similar in their representation, effectively compressing the sequence in both spatial and temporal dimensions.
Key goals of video token merging:
- Reduce the number of tokens entering costly self-attention modules.
- Maintain or minimize loss of accuracy by preserving semantically important tokens or aggregating information rather than simply dropping content.
- Enable faster inference and/or training with existing transformer architectures, often in a plug-and-play, training-free fashion.
2. Core Methodologies
2.1 Layerwise Bipartite Soft Matching and Merging
Representative of this approach is ToMe (2210.09461), where, at each transformer block, tokens are split into two sets (e.g., by index), and the most similar pairs (via cosine similarity of attention keys) across the sets are selected and merged. This is repeated per layer, progressively reducing the sequence length.
Mathematically:
- Given token representations , split into sets .
- For each , find of similarity.
- Select top pairs and replace them by weighted averages based on how many source patches each represents.
- Track token "size" to adjust subsequent attention scores (proportional attention).
This merging can be performed not only spatially, but also temporally, to compress across adjacent frames when redundancy is highest.
2.2 Saliency-Aware and Context-Adaptive Token Merging
Later methods (e.g., vid-TLDR (2403.13347), Learnable VTM (2410.23782)) improve upon similarity-based approaches by explicitly estimating token saliency—determining which tokens represent important content (e.g., foreground or moving objects) using mechanisms such as attention entropy or small neural networks. Saliency is used to:
- Drop background or uninformative tokens early.
- Weight token merges so that salient tokens dominate the merged representation.
- Dynamically adapt the merge rate, preserving more tokens in complex or action-rich regions.
Saliency can be computed via:
- Attention sharpness: Negative entropy of early attention distributions.
- Learned scores: Small MLPs trained to predict oracle token importance.
- Application-specific proxies: e.g., Classifier-free guidance magnitudes in diffusion models (2411.16720).
2.3 Temporal and Semantic Token Merging
Frameworks such as TempMe (2409.01156) and PruneVid (2412.16117) focus on temporal redundancy, merging similar tokens across frames or over entire segments where content remains mostly static. More recent paradigms, such as trajectory-based tokenization (TrajViT (2505.23617)), group tokens according to object or part trajectories instead of spatio-temporal patches. This yields token counts that scale with scene complexity rather than video length.
QuoTA (2503.08689) introduces task- or query-aware token assignment by scoring the relevance of each frame or object segment to a specific instruction, allocating more tokens only to crucial segments.
3. Impact on Computational Efficiency and Model Performance
Methods such as ToMe have demonstrated up to 2.2× throughput increase for ViT-L on video with negligible (0.3%) accuracy drop on Kinetics-400 (2210.09461), and similar accelerations for other transformers (ViViT, VideoMAE, TimeSformer) with marginal losses (2506.03885).
Saliency-aware methods (vid-TLDR, Learnable VTM) routinely provide up to 84% memory reduction and 6.89× throughput gains on long-form video without harming, and often improving, test set accuracy (2410.23782, 2403.13347).
Trajectory- and query-driven schemes (TrajViT, QuoTA) enable even more aggressive reduction (up to 10×–18× fewer tokens (2505.23617)) while either matching or exceeding previous SOTA accuracy—particularly on long or complex video tasks—including improvements in retrieval, classification, and generative video tasks.
4. Comparison to Traditional Token Pruning and Related Approaches
Token merging differs fundamentally from pruning: rather than discarding tokens outright (which risks loss of crucial content and variable tensor shapes for batching), merging aggregates information from redundant patches or frames into fewer, more informative tokens. This results in:
- Higher accuracy under strong compression: Merged representations integrate features, avoiding information loss typical of pruning (2210.09461).
- Determinate sequence length: Output token count can be controlled and remains batch-friendly.
- Plug-and-play deployment: Most schemes require no retraining and can be applied to pretrained transformers off-the-shelf.
- Superior speed-accuracy trade-off: Outperforms SOTA pruning and clustering strategies both in throughput and in top-1/top-5 accuracy across benchmarks.
Table: Comparison of Key Characteristics
Method | Merge/Prune | Saliency/Similarity | Training Required | Accuracy Retention | Efficiency |
---|---|---|---|---|---|
ToMe (2210.09461) | Merging | Similarity (K) | No | High (−0.3%) | 2×–2.2× |
vid-TLDR (2403.13347) | Merging | Attention sharpness | No | None/Improved | Up to 70% FLOPs cut |
Pruning (DynamicViT) | Pruning | Learned/Rule-based | Yes | Lower | Good |
TrajViT (2505.23617) | Merging | Object trajectories | Yes (pretrain) | +6% over ViT3D | 10x-18x efficiency |
5. Applications and Extensions
Video token merging has broad utility across domains:
- Efficient long video understanding: Enables multi-minute or high frame-rate analysis (e.g., in LVU or COIN) (2410.23782).
- Video LLMs (VideoLLMs): Scalable processing of long sequences with reduced context bottleneck (HoliTom (2505.21334), Token Dynamics (2503.16980)).
- Video generation and editing with diffusion models: Lower latency and improved temporal consistency by merging temporal tokens during denoising (VidToMe (2312.10656), ReToMe-VA (2408.05479), Importance-based merging (2411.16720)).
- Text-video retrieval: Reduces inference complexity and memory, permits higher throughput and larger batch sizes (TempMe (2409.01156)).
- Zero-shot and efficient multimodal QA: Query-aware merging (QuoTA (2503.08689), AIM (2412.03248)) allows resource allocation focused on relevant segments for video question answering and dialogue.
6. Challenges, Best Practices, and Future Directions
Scheduling and Placement: Merging schedules (constant, increasing-in-layer) and layer placement are critical. Best practices suggest merging up to 10% per layer (60% total) produces minimal accuracy loss with maximal speedup (2506.03885), and early-layer merging provides the largest efficiency gains.
Saliency Estimation: The reliability of learned or derived saliency measures impacts performance. Attention sharpness and task-driven or query-driven scoring have yielded robust improvements; proxy cues (visual saliency, simple motion) are less predictive of oracle token value (2411.13626).
Compositional Merging: Combining outer-LLM (pre-LM) and inner-LLM (within-LM) merging—when carefully integrated—yields further redundancy reduction (to as little as 6.9% of original FLOPs) with negligible performance drop (2505.21334).
Limitations and Open Problems: Some dynamic merging methods require additional computation for information density estimation, and very aggressive merging can harm small-object or fine-grained recognition. The learning of token value, even with sophisticated visual cues, remains challenging (2411.13626). Scaling token merging to fully online, streaming, and cross-modal video tasks is an active area for future research.
7. Summary Table: Video Token Merging—Selected Benchmarks
Paper (Model/Method) | Token Reduction | Accuracy Δ | Throughput/FLOPs Gain | Video Task |
---|---|---|---|---|
ToMe (2210.09461) | 96–98% tokens | −0.3% | 2×–2.2× | Kinetics-400, AudioSet |
vid-TLDR (2403.13347) | Up to 50% FLOPs | 0% or + | Up to 70% FLOPs ↓ | VideoCLIP/Retrieval |
TempMe (2409.01156) | 95% fewer tokens | +4.4–7.9% | 1.8–13.7× speedup | Text-Video Retrieval |
HoliTom (2505.21334) | 90% tokens | −0.9% | 2.28× TTFT; 1.32× dec. | VideoLLM Benchmarks |
TrajViT (2505.23617) | 10× fewer tokens | +6% R@5 | 4×–18× faster | Retrieval, QA, VideoLLM |
Video token merging has emerged as a foundational approach for resource-efficient video transformer modeling, supporting advances in large-scale video understanding, real-time inference, and generalizable multimodal applications across the spectrum of video AI research.