Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 57 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 176 tok/s Pro

GPT OSS 120B 449 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Video Token Merging Techniques

Updated 30 June 2025

Video token merging is a family of methods that combine similar or redundant tokens in video transformers to reduce computational cost while preserving key information.
Techniques use similarity measures and saliency estimation to intelligently aggregate token representations across spatial and temporal dimensions.
These methods enable efficient long-video processing with significant speedup and memory reduction, supporting tasks like retrieval, classification, and video generation.

Video token merging is a family of methods for reducing the computational cost of video transformer models by intelligently combining similar or redundant tokens throughout a video sequence, rather than processing every patch or frame at a fixed granularity. These techniques are central to scaling video transformers to long sequences and enabling more efficient deployment while preserving accuracy.

1. Fundamental Principles and Goals

Video transformers operate on sequences of tokens—typically patch embeddings from each frame—resulting in quadratic growth in computation and memory with sequence length and spatial size. Many of these tokens are highly redundant, especially across similar frames or static regions. Video token merging algorithms exploit this redundancy by combining (merging) tokens that are similar in their representation, effectively compressing the sequence in both spatial and temporal dimensions.

Key goals of video token merging:

Reduce the number of tokens entering costly self-attention modules.
Maintain or minimize loss of accuracy by preserving semantically important tokens or aggregating information rather than simply dropping content.
Enable faster inference and/or training with existing transformer architectures, often in a plug-and-play, training-free fashion.

2. Core Methodologies

2.1 Layerwise Bipartite Soft Matching and Merging

Representative of this approach is ToMe (Bolya et al., 2022), where, at each transformer block, tokens are split into two sets (e.g., by index), and the most similar pairs (via cosine similarity of attention keys) across the sets are selected and merged. This is repeated per layer, progressively reducing the sequence length.

Mathematically:

Given token representations $X \in \mathbb{R}^{N \times D}$ , split into sets $A, B$ .
For each $a_i \in A$ , find $\operatorname*{arg\,max}_{b_j \in B}$ of similarity.
Select top $r$ pairs and replace them by weighted averages based on how many source patches each represents.
Track token "size" $s$ to adjust subsequent attention scores (proportional attention).

This merging can be performed not only spatially, but also temporally, to compress across adjacent frames when redundancy is highest.

2.2 Saliency-Aware and Context-Adaptive Token Merging

Later methods (e.g., vid-TLDR (Choi et al., 20 Mar 2024), Learnable VTM (Lee et al., 31 Oct 2024)) improve upon similarity-based approaches by explicitly estimating token saliency—determining which tokens represent important content (e.g., foreground or moving objects) using mechanisms such as attention entropy or small neural networks. Saliency is used to:

Drop background or uninformative tokens early.
Weight token merges so that salient tokens dominate the merged representation.
Dynamically adapt the merge rate, preserving more tokens in complex or action-rich regions.

Saliency can be computed via:

Attention sharpness: Negative entropy of early attention distributions.
Learned scores: Small MLPs trained to predict oracle token importance.
Application-specific proxies: e.g., Classifier-free guidance magnitudes in diffusion models (Wu et al., 23 Nov 2024).

2.3 Temporal and Semantic Token Merging

Frameworks such as TempMe (Shen et al., 2 Sep 2024) and PruneVid (Huang et al., 20 Dec 2024) focus on temporal redundancy, merging similar tokens across frames or over entire segments where content remains mostly static. More recent paradigms, such as trajectory-based tokenization (TrajViT (Zheng et al., 29 May 2025)), group tokens according to object or part trajectories instead of spatio-temporal patches. This yields token counts that scale with scene complexity rather than video length.

QuoTA (Luo et al., 11 Mar 2025) introduces task- or query-aware token assignment by scoring the relevance of each frame or object segment to a specific instruction, allocating more tokens only to crucial segments.

3. Impact on Computational Efficiency and Model Performance

Methods such as ToMe have demonstrated up to 2.2× throughput increase for ViT-L on video with negligible ( $\sim$ 0.3%) accuracy drop on Kinetics-400 (Bolya et al., 2022), and similar accelerations for other transformers (ViViT, VideoMAE, TimeSformer) with marginal losses (Pollard et al., 4 Jun 2025).

Saliency-aware methods (vid-TLDR, Learnable VTM) routinely provide up to 84% memory reduction and 6.89× throughput gains on long-form video without harming, and often improving, test set accuracy (Lee et al., 31 Oct 2024, Choi et al., 20 Mar 2024).

Trajectory- and query-driven schemes (TrajViT, QuoTA) enable even more aggressive reduction (up to 10×–18× fewer tokens (Zheng et al., 29 May 2025)) while either matching or exceeding previous SOTA accuracy—particularly on long or complex video tasks—including improvements in retrieval, classification, and generative video tasks.

Token merging differs fundamentally from pruning: rather than discarding tokens outright (which risks loss of crucial content and variable tensor shapes for batching), merging aggregates information from redundant patches or frames into fewer, more informative tokens. This results in:

Higher accuracy under strong compression: Merged representations integrate features, avoiding information loss typical of pruning (Bolya et al., 2022).
Determinate sequence length: Output token count can be controlled and remains batch-friendly.
Plug-and-play deployment: Most schemes require no retraining and can be applied to pretrained transformers off-the-shelf.
Superior speed-accuracy trade-off: Outperforms SOTA pruning and clustering strategies both in throughput and in top-1/top-5 accuracy across benchmarks.

Table: Comparison of Key Characteristics

Method	Merge/Prune	Saliency/Similarity	Training Required	Accuracy Retention	Efficiency
ToMe (Bolya et al., 2022)	Merging	Similarity (K)	No	High (−0.3%)	2×–2.2×
vid-TLDR (Choi et al., 20 Mar 2024)	Merging	Attention sharpness	No	None/Improved	Up to 70% FLOPs cut
Pruning (DynamicViT)	Pruning	Learned/Rule-based	Yes	Lower	Good
TrajViT (Zheng et al., 29 May 2025)	Merging	Object trajectories	Yes (pretrain)	+6% over ViT3D	10x-18x efficiency

5. Applications and Extensions

Video token merging has broad utility across domains:

Efficient long video understanding: Enables multi-minute or high frame-rate analysis (e.g., in LVU or COIN) (Lee et al., 31 Oct 2024).
Video LLMs (VideoLLMs): Scalable processing of long sequences with reduced context bottleneck (HoliTom (Shao et al., 27 May 2025), Token Dynamics (Zhang et al., 21 Mar 2025)).
Video generation and editing with diffusion models: Lower latency and improved temporal consistency by merging temporal tokens during denoising (VidToMe (Li et al., 2023), ReToMe-VA (Gao et al., 10 Aug 2024), Importance-based merging (Wu et al., 23 Nov 2024)).
Text-video retrieval: Reduces inference complexity and memory, permits higher throughput and larger batch sizes (TempMe (Shen et al., 2 Sep 2024)).
Zero-shot and efficient multimodal QA: Query-aware merging (QuoTA (Luo et al., 11 Mar 2025), AIM (Zhong et al., 4 Dec 2024)) allows resource allocation focused on relevant segments for video question answering and dialogue.

6. Challenges, Best Practices, and Future Directions

Scheduling and Placement: Merging schedules (constant, increasing-in-layer) and layer placement are critical. Best practices suggest merging up to 10% per layer (60% total) produces minimal accuracy loss with maximal speedup (Pollard et al., 4 Jun 2025), and early-layer merging provides the largest efficiency gains.

Saliency Estimation: The reliability of learned or derived saliency measures impacts performance. Attention sharpness and task-driven or query-driven scoring have yielded robust improvements; proxy cues (visual saliency, simple motion) are less predictive of oracle token value (Hao et al., 20 Nov 2024).

Compositional Merging: Combining outer-LLM (pre-LM) and inner-LLM (within-LM) merging—when carefully integrated—yields further redundancy reduction (to as little as 6.9% of original FLOPs) with negligible performance drop (Shao et al., 27 May 2025).

Limitations and Open Problems: Some dynamic merging methods require additional computation for information density estimation, and very aggressive merging can harm small-object or fine-grained recognition. The learning of token value, even with sophisticated visual cues, remains challenging (Hao et al., 20 Nov 2024). Scaling token merging to fully online, streaming, and cross-modal video tasks is an active area for future research.

7. Summary Table: Video Token Merging—Selected Benchmarks

Paper (Model/Method)	Token Reduction	Accuracy Δ	Throughput/FLOPs Gain	Video Task
ToMe (Bolya et al., 2022)	96–98% tokens	−0.3%	2×–2.2×	Kinetics-400, AudioSet
vid-TLDR (Choi et al., 20 Mar 2024)	Up to 50% FLOPs	0% or +	Up to 70% FLOPs ↓	VideoCLIP/Retrieval
TempMe (Shen et al., 2 Sep 2024)	95% fewer tokens	+4.4–7.9%	1.8–13.7× speedup	Text-Video Retrieval
HoliTom (Shao et al., 27 May 2025)	90% tokens	−0.9%	2.28× TTFT; 1.32× dec.	VideoLLM Benchmarks
TrajViT (Zheng et al., 29 May 2025)	10× fewer tokens	+6% R@5	4×–18× faster	Retrieval, QA, VideoLLM

Video token merging has emerged as a foundational approach for resource-efficient video transformer modeling, supporting advances in large-scale video understanding, real-time inference, and generalizable multimodal applications across the spectrum of video AI research.