LLaVA-Scissor: Efficient Token Compression for VLLMs
- LLaVA-Scissor is a training-free, semantic-region-based method for token compression in Video LLMs (VLLMs).
- It uses Spatial and Temporal Semantic Connected Components (SCC) to group and represent tokens, preserving semantics while significantly reducing redundancy.
- LLaVA-Scissor improves efficiency and performance in video understanding tasks like QA and long video analysis, especially under tight resource constraints.
LLaVA-Scissor encompasses a set of methodologies for efficient token compression in video multimodal LLMs (VLLMs), distinguished by a training-free, semantic-region-based token selection algorithm. The approach is motivated by the need to reduce computation and memory costs in video understanding tasks, where traditional Vision-LLMs (VLMs) produce a profusion of redundant tokens from individually encoded frames. LLaVA-Scissor introduces Semantic Connected Components (SCC)—a partitioning of tokens into mutually exclusive, semantically coherent groups—yielding a compact and comprehensive token representation for long videos and complex video-based question answering scenarios (2506.21862).
1. Motivation and Background
VLLMs process visual streams by independently encoding frames, resulting in a large, often redundant set of visual tokens. While previous compression methods typically rank and retain tokens based on attention scores (e.g., from a [CLS] token), these strategies focus on the most salient regions and neglect to cover the full scope of video semantics. Moreover, redundant token selection arises due to overlapping attention, diminishing efficiency and semantic completeness. Architecture-based or retrained compression modules offer improvements, but require model-specific training and are not universally portable.
LLaVA-Scissor addresses these issues by introducing a training-free, architecture-agnostic method that targets both comprehensive semantic coverage and redundancy minimization.
2. Semantic Connected Components (SCC) Compression
Methodological Overview
LLaVA-Scissor applies the SCC strategy in two steps—spatially per video frame, and temporally across video frames:
- Spatial SCC (per frame): Each frame’s token set is mapped into a similarity graph. Pairwise similarities are computed between tokens; edges above a fixed threshold connect tokens representing similar semantic regions:
where is the binary adjacency matrix for tokens.
- Connected Component Analysis: Graph components are extracted using the Union-Find data structure with path compression and union by rank for efficiency. Each connected component corresponds to a semantically homogeneous region in token space. The partitioning ensures that and .
- Spatial Aggregation: For each component , a representative token is computed by averaging:
- Temporal SCC (across frames): All spatially compressed tokens from each frame are concatenated and SCC is again applied across this set, removing temporal redundancy and yielding the final set of non-overlapping, semantically distinct tokens for the entire video.
- Token Reassignment and Finalization: Each original token is matched to its closest compressed token via similarity for reconstruction:
Algorithmic Efficiency
- Exact SCC: ;
- Approximate SCC with Sampling: , where is the inverse Ackermann function. Approximate connected component search subsamples nodes to accelerate decomposition while maintaining coverage.
3. Empirical Performance and Benchmarking
LLaVA-Scissor is evaluated on diverse video understanding tasks, including Video Question Answering (ActivityNet-QA, VideoChatGPT, Next-QA), long video comprehension (EgoSchema, MLVU, VideoMME, VideoMMMU), and multi-choice evaluation (MVBench).
Key findings:
- Token Efficiency: LLaVA-Scissor achieves effective compression, maintaining higher accuracy at aggressive token reduction ratios (as low as 5–10%) compared to prior baselines (FastV, DyCoke, VisionZip, PLLaVA).
- Semantic Preservation: Spatial+temporal SCC minimizes performance loss even under extreme token pruning, outperforming random, uniform sampling, L2Norm, and attention-based selection in semantic recall.
- Computational Savings: Model FLOPs decrease as compression is applied post-encoding and pre-LLM, reducing inference cost especially for long or complex videos.
- Ablation Results: Only semantics-driven grouping (SCC) avoids substantial drops in performance at low token retention; spatial+temporal SCC outperforms spatial-only and attention-based methods in coverage and accuracy.
- Redundancy Law: Uniform or random sampling suffices at high retention rates but SCC far excels at lower ratios (<35%), indicating significant redundancy and the superior selectivity of SCC.
4. Applications in Video LLMing
LLaVA-Scissor is suitable for:
- Resource-Constrained Video QA: Reliable and efficient question answering over long or detailed video content, especially where hardware or prompt length restrictions apply.
- Long Video Understanding: Temporal reasoning, summarization, and segment-based analysis-free from limitations imposed by token explosion.
- Edge and Mobile Deployment: Reduces memory and computation for video language applications on devices with strict resource envelopes.
- General Video Comprehension: Efficient support for classification, captioning, retrieval, and analytic tasks across a range of video analysis pipelines.
5. Comparative Context and Implications for Future Research
The SCC-based LLaVA-Scissor strategy is model-agnostic and does not require retraining or fine-tuning, enabling plug-in deployment within existing VLLM pipelines. This distinguishes it from architecture-based or trainable selection modules, which are not universally portable and often demand significant additional resources.
Notably, LLaVA-Scissor outperforms both training-free (e.g., top-k attention score, uniform sampling) and trainable architectural solutions in all evaluated benchmarks, particularly under stringent token budgets.
Potential future directions involve exploring more refined methods for constructing token similarity graphs, optimizing threshold selection, and hybridizing SCC with dynamic attention-based or task-specific scoring functions. Further investigation into temporal component aggregation and streaming deployment for real-time applications is also suggested by these results.
6. Theoretical and Algorithmic Formulation
Central notations and algorithms:
- Token similarity adjacency:
Each token pair where is considered semantically linked.
- Connected components search:
Performed using the efficient Union-Find algorithm with path compression and union by rank to partition the token graph.
- Compression via component averaging:
For each component :
- Final compressed representation:
Each token is mapped to its cluster's mean, ensuring that the compressed set is both non-redundant and semantically representative.
7. Significance and Outlook
LLaVA-Scissor demonstrates that training-free, semantic-region-based token selection is an effective and robust solution to visual token redundancy in VLLMs. Its superiority at low retention ratios underscores its utility for real-world systems where compute, memory, or prompt length are primary constraints. The methodological framework invites further adaptation for other data modalities—potentially extending to audio or text sequence compression—and suggests a general paradigm for scalable, resource-efficient deep multimodal inference.
References:
Key details, results, and code are available in [LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs, (2506.21862)] and its project repository.