MMG-Vid: Efficient Visual Token Pruning

Updated 1 September 2025

MMG-Vid is a training-free visual token pruning framework that reduces computational load in video LLMs by dynamically optimizing token selection at both segment and token levels.
It segments videos based on frame similarity, allocates dynamic token budgets, and applies temporal-guided clustering to preserve the most informative and diverse tokens.
MMG-Vid achieves over 99.5% performance retention while pruning 75% of tokens, significantly improving latency for real-world video-language applications.

MMG‑Vid is a training-free visual token pruning framework for Video LLMs (VLLMs) that addresses the computational challenges posed by excessive visual token counts during inference. MMG-Vid removes redundancy by dynamically maximizing marginal gains at both segment-level and token-level. The method prioritizes tokens based on their representativeness and diversity within semantic segments and over time, yielding substantial reductions in token count and inference latency with negligible performance degradation in video-language tasks (Ma et al., 28 Aug 2025).

1. Motivation and Framework Overview

VLLMs process video inputs by converting each frame into hundreds or thousands of visual tokens, feeding these into quadratic-complexity self-attention layers. This approach yields excellent video understanding, but the computational cost—especially at inference—limits deployment for real-world tasks where efficiency is paramount. Token pruning is essential, yet prior work either applies naive, static strategies or ignores video dynamics that introduce temporal redundancy.

MMG-Vid operates in three stages:

Segment-Level Pruning: Videos are divided into meaningful segments based on frame similarity metrics.
Dynamic Token Budget Allocation: Each segment receives a tailored number of tokens, determined by its marginal informational value.
Token-Level Pruning with Temporal Guidance: Within each segment, a temporal-guided clustering algorithm preserves tokens that are locally and temporally salient.

This training-free, plug-and-play pipeline is agnostic to underlying VLLM architectures and requires no retraining.

2. Segment-Level Division and Dynamic Token Budgeting

Videos $V = \{F_1, F_2, \ldots, F_T\}$ are first segmented based on frame-level feature similarity. Frame embeddings $f_t$ are compared with cosine similarity: $\text{sim}(F_t, F_{t+1}) = \frac{f_t \cdot f_{t+1}}{\|f_t\| \cdot \|f_{t+1}\|}$ Frames where this metric drops below a threshold $\tau$ become segment boundaries. To avoid overly granular segmentation, single-frame segments are merged with their most similar neighbor.

Each segment $S_k$ receives a dynamically apportioned token budget $R_k$ via a marginal gain formulation: $\text{MV}(S_k | S_\text{sel}) = \lambda \cdot \text{sim}(g_k, \bar{g}_\text{r}) + (1-\lambda) \cdot [1-\text{sim}(g_k, \bar{g}_\text{s})]$ where:

$g_k$ is the segment embedding,
$\bar{g}_\text{r}$ is the average of the unselected segments,
$\bar{g}_\text{s}$ is the average of the already selected segments,
$\lambda$ controls representativeness vs. diversity.

Budgets are computed by Z-score normalization: $R_k = R_\text{min} + R_\text{extra} \cdot \frac{\text{MV}(S_k)-\mathrm{mean(MV)}}{\mathrm{std(MV)}}$ This mechanism ensures complex, information-rich segments are prioritized with more tokens.

3. Token-Level Pruning via Temporal-Guided DPC

Within each segment, token selection considers both intra-frame saliency and inter-frame novelty.

For the first frame in a segment, the Density-Peak Clustering (DPC)-KNN approach is used:

Compute local density for token $t_i$ : $\rho_i = \exp\left(-\frac{1}{n}\sum_{j\in kNN(t_i)} \|t_i-t_j\|^2\right)$
Determine minimum distance to higher-density tokens: $\delta_i$
Set token score $\gamma_i = \rho_i \cdot \delta_i$ , and select highest-scoring tokens.

For subsequent frames, the Temporal-Guided DPC (TG-DPC) algorithm uses reference tokens previously selected from earlier frames:

Temporal relevance density: $\rho_i^t = 1 - \exp\left( -\frac{1}{k}\sum_{j=1}^k \| t_i - t_g^{(j)} \|^2 \right)$ where $t_g^{(j)}$ are nearest tokens from previous frames.
Temporal intra-frame separation $\delta_i^t$ is computed analogously to DPC, but within the current frame.
New score: $\gamma_i^t = \rho_i^t \cdot \delta_i^t$

Tokens are then selected to maximize temporal uniqueness and maintain intra-frame diversity, optimizing for utilization of a strict token budget.

4. Performance Evaluation and Metrics

The quality of a selected token subset $T_\text{sub}$ is quantified as: $Q(T_\text{sub}) = \sum_{t_i \in T_\text{sub}} I(t_i) - \beta \sum_{i\ne j \in T_\text{sub}} D(t_i, t_j)$ where $I(t_i)$ measures representativeness and $D(t_i, t_j)$ quantifies redundancy.

On LLaVA-OneVision-7B, MMG-Vid retains >99.5% of the original VLLM performance even after pruning 75% of visual tokens. Prefilling latency is reduced by 3.9 $\times$ , and generation time by 3.1 $\times$ relative to baseline. These improvements are consistent across video benchmarks such as MVBench, LongVideoBench, MLVU, and VideoMME.

5. Comparative Analysis and Ablation Studies

Ablation results establish the necessity of dynamic segment budgeting and temporal-guided token pruning. Replacing TG-DPC with standard frame-wise clustering or using uniform token allocation yields observable drops in performance. Compared to previous approaches (FastV, VisionZip, PruneVid, FrameFusion), MMG-Vid consistently demonstrates either equivalent or superior retention of accuracy under substantially lower token counts.

6. Implications for Video-LLM Deployment

MMG-Vid enables real-time or resource-constrained deployment of VLLMs by pushing the computational bottleneck down through strategic pruning, while maintaining semantic and temporal comprehensiveness in the video token representation. Its training-free and modular design makes it broadly applicable and easily integrated with existing VLLM architectures.

A plausible implication is that such efficient token selection will become standard for mobile, edge, or low-latency VLLM deployments, and may further encourage research into adaptive, context-aware pruning strategies. The framework’s design—jointly maximizing intra-segment and intra-frame marginal gains—represents a generalizable approach for redundancy reduction in sequential visual token processing.

7. Future Directions

The MMG-Vid framework suggests several avenues for further exploration:

Enhancing temporal modeling by conditioning token selection on higher-order dynamics or longer-term dependencies.
Extending pruning strategies to multi-modal fusion, for example in joint video-text architectures.
Incorporating training-aware or differentiable pruning mechanisms, potentially improving end-to-end integration.
Investigating how such strategies translate to other domains where sequential redundancy is an issue (e.g., long-form document LLMs, medical video analytics).

In conclusion, MMG-Vid maximizes efficiency and maintains fidelity for video LLMs in real-world applications by jointly optimizing segment-level and token-level marginal gains—a significant contribution to the deployment and scalability of video–language understanding systems (Ma et al., 28 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

MMG-Vid: Maximizing Marginal Gains at Segment-level and Token-level for Efficient Video LLMs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MMG-Vid.