Video Token Merging Techniques
- Video token merging is a family of methods that combine similar or redundant tokens in video transformers to reduce computational cost while preserving key information.
- Techniques use similarity measures and saliency estimation to intelligently aggregate token representations across spatial and temporal dimensions.
- These methods enable efficient long-video processing with significant speedup and memory reduction, supporting tasks like retrieval, classification, and video generation.
Video token merging is a family of methods for reducing the computational cost of video transformer models by intelligently combining similar or redundant tokens throughout a video sequence, rather than processing every patch or frame at a fixed granularity. These techniques are central to scaling video transformers to long sequences and enabling more efficient deployment while preserving accuracy.
1. Fundamental Principles and Goals
Video transformers operate on sequences of tokens—typically patch embeddings from each frame—resulting in quadratic growth in computation and memory with sequence length and spatial size. Many of these tokens are highly redundant, especially across similar frames or static regions. Video token merging algorithms exploit this redundancy by combining (merging) tokens that are similar in their representation, effectively compressing the sequence in both spatial and temporal dimensions.
Key goals of video token merging:
- Reduce the number of tokens entering costly self-attention modules.
- Maintain or minimize loss of accuracy by preserving semantically important tokens or aggregating information rather than simply dropping content.
- Enable faster inference and/or training with existing transformer architectures, often in a plug-and-play, training-free fashion.
2. Core Methodologies
2.1 Layerwise Bipartite Soft Matching and Merging
Representative of this approach is ToMe (Bolya et al., 2022), where, at each transformer block, tokens are split into two sets (e.g., by index), and the most similar pairs (via cosine similarity of attention keys) across the sets are selected and merged. This is repeated per layer, progressively reducing the sequence length.
Mathematically:
- Given token representations , split into sets .
- For each , find of similarity.
- Select top pairs and replace them by weighted averages based on how many source patches each represents.
- Track token "size" to adjust subsequent attention scores (proportional attention).
This merging can be performed not only spatially, but also temporally, to compress across adjacent frames when redundancy is highest.
2.2 Saliency-Aware and Context-Adaptive Token Merging
Later methods (e.g., vid-TLDR (Choi et al., 20 Mar 2024), Learnable VTM (Lee et al., 31 Oct 2024)) improve upon similarity-based approaches by explicitly estimating token saliency—determining which tokens represent important content (e.g., foreground or moving objects) using mechanisms such as attention entropy or small neural networks. Saliency is used to:
- Drop background or uninformative tokens early.
- Weight token merges so that salient tokens dominate the merged representation.
- Dynamically adapt the merge rate, preserving more tokens in complex or action-rich regions.
Saliency can be computed via:
- Attention sharpness: Negative entropy of early attention distributions.
- Learned scores: Small MLPs trained to predict oracle token importance.
- Application-specific proxies: e.g., Classifier-free guidance magnitudes in diffusion models (Wu et al., 23 Nov 2024).
2.3 Temporal and Semantic Token Merging
Frameworks such as TempMe (Shen et al., 2 Sep 2024) and PruneVid (Huang et al., 20 Dec 2024) focus on temporal redundancy, merging similar tokens across frames or over entire segments where content remains mostly static. More recent paradigms, such as trajectory-based tokenization (TrajViT (Zheng et al., 29 May 2025)), group tokens according to object or part trajectories instead of spatio-temporal patches. This yields token counts that scale with scene complexity rather than video length.
QuoTA (Luo et al., 11 Mar 2025) introduces task- or query-aware token assignment by scoring the relevance of each frame or object segment to a specific instruction, allocating more tokens only to crucial segments.
3. Impact on Computational Efficiency and Model Performance
Methods such as ToMe have demonstrated up to 2.2× throughput increase for ViT-L on video with negligible (0.3%) accuracy drop on Kinetics-400 (Bolya et al., 2022), and similar accelerations for other transformers (ViViT, VideoMAE, TimeSformer) with marginal losses (Pollard et al., 4 Jun 2025).
Saliency-aware methods (vid-TLDR, Learnable VTM) routinely provide up to 84% memory reduction and 6.89× throughput gains on long-form video without harming, and often improving, test set accuracy (Lee et al., 31 Oct 2024, Choi et al., 20 Mar 2024).
Trajectory- and query-driven schemes (TrajViT, QuoTA) enable even more aggressive reduction (up to 10×–18× fewer tokens (Zheng et al., 29 May 2025)) while either matching or exceeding previous SOTA accuracy—particularly on long or complex video tasks—including improvements in retrieval, classification, and generative video tasks.
4. Comparison to Traditional Token Pruning and Related Approaches
Token merging differs fundamentally from pruning: rather than discarding tokens outright (which risks loss of crucial content and variable tensor shapes for batching), merging aggregates information from redundant patches or frames into fewer, more informative tokens. This results in:
- Higher accuracy under strong compression: Merged representations integrate features, avoiding information loss typical of pruning (Bolya et al., 2022).
- Determinate sequence length: Output token count can be controlled and remains batch-friendly.
- Plug-and-play deployment: Most schemes require no retraining and can be applied to pretrained transformers off-the-shelf.
- Superior speed-accuracy trade-off: Outperforms SOTA pruning and clustering strategies both in throughput and in top-1/top-5 accuracy across benchmarks.
Table: Comparison of Key Characteristics
Method | Merge/Prune | Saliency/Similarity | Training Required | Accuracy Retention | Efficiency |
---|---|---|---|---|---|
ToMe (Bolya et al., 2022) | Merging | Similarity (K) | No | High (−0.3%) | 2×–2.2× |
vid-TLDR (Choi et al., 20 Mar 2024) | Merging | Attention sharpness | No | None/Improved | Up to 70% FLOPs cut |
Pruning (DynamicViT) | Pruning | Learned/Rule-based | Yes | Lower | Good |
TrajViT (Zheng et al., 29 May 2025) | Merging | Object trajectories | Yes (pretrain) | +6% over ViT3D | 10x-18x efficiency |
5. Applications and Extensions
Video token merging has broad utility across domains:
- Efficient long video understanding: Enables multi-minute or high frame-rate analysis (e.g., in LVU or COIN) (Lee et al., 31 Oct 2024).
- Video LLMs (VideoLLMs): Scalable processing of long sequences with reduced context bottleneck (HoliTom (Shao et al., 27 May 2025), Token Dynamics (Zhang et al., 21 Mar 2025)).
- Video generation and editing with diffusion models: Lower latency and improved temporal consistency by merging temporal tokens during denoising (VidToMe (Li et al., 2023), ReToMe-VA (Gao et al., 10 Aug 2024), Importance-based merging (Wu et al., 23 Nov 2024)).
- Text-video retrieval: Reduces inference complexity and memory, permits higher throughput and larger batch sizes (TempMe (Shen et al., 2 Sep 2024)).
- Zero-shot and efficient multimodal QA: Query-aware merging (QuoTA (Luo et al., 11 Mar 2025), AIM (Zhong et al., 4 Dec 2024)) allows resource allocation focused on relevant segments for video question answering and dialogue.
6. Challenges, Best Practices, and Future Directions
Scheduling and Placement: Merging schedules (constant, increasing-in-layer) and layer placement are critical. Best practices suggest merging up to 10% per layer (60% total) produces minimal accuracy loss with maximal speedup (Pollard et al., 4 Jun 2025), and early-layer merging provides the largest efficiency gains.
Saliency Estimation: The reliability of learned or derived saliency measures impacts performance. Attention sharpness and task-driven or query-driven scoring have yielded robust improvements; proxy cues (visual saliency, simple motion) are less predictive of oracle token value (Hao et al., 20 Nov 2024).
Compositional Merging: Combining outer-LLM (pre-LM) and inner-LLM (within-LM) merging—when carefully integrated—yields further redundancy reduction (to as little as 6.9% of original FLOPs) with negligible performance drop (Shao et al., 27 May 2025).
Limitations and Open Problems: Some dynamic merging methods require additional computation for information density estimation, and very aggressive merging can harm small-object or fine-grained recognition. The learning of token value, even with sophisticated visual cues, remains challenging (Hao et al., 20 Nov 2024). Scaling token merging to fully online, streaming, and cross-modal video tasks is an active area for future research.
7. Summary Table: Video Token Merging—Selected Benchmarks
Paper (Model/Method) | Token Reduction | Accuracy Δ | Throughput/FLOPs Gain | Video Task |
---|---|---|---|---|
ToMe (Bolya et al., 2022) | 96–98% tokens | −0.3% | 2×–2.2× | Kinetics-400, AudioSet |
vid-TLDR (Choi et al., 20 Mar 2024) | Up to 50% FLOPs | 0% or + | Up to 70% FLOPs ↓ | VideoCLIP/Retrieval |
TempMe (Shen et al., 2 Sep 2024) | 95% fewer tokens | +4.4–7.9% | 1.8–13.7× speedup | Text-Video Retrieval |
HoliTom (Shao et al., 27 May 2025) | 90% tokens | −0.9% | 2.28× TTFT; 1.32× dec. | VideoLLM Benchmarks |
TrajViT (Zheng et al., 29 May 2025) | 10× fewer tokens | +6% R@5 | 4×–18× faster | Retrieval, QA, VideoLLM |
Video token merging has emerged as a foundational approach for resource-efficient video transformer modeling, supporting advances in large-scale video understanding, real-time inference, and generalizable multimodal applications across the spectrum of video AI research.