Papers
Topics
Authors
Recent
Search
2000 character limit reached

VScan: Efficient Visual Token Management

Updated 20 January 2026
  • VScan is a suite of approaches utilizing clustering, token reduction, and semantic carrier methods for efficient video understanding.
  • The framework employs dual-feature DBSCAN and a two-stage reduction process to balance token compression with high accuracy retention.
  • Applications span real-time streaming and advanced vision-language models, significantly cutting required FLOPs and GPU memory usage.

VScan encompasses a family of methods and frameworks designed for efficient and effective visual token management in video understanding and vision-language modeling. The term "VScan" refers to several distinct systems: an early clustering-based static video summarization approach using dual-feature DBSCAN (Mohamed et al., 2014), as well as recent visual token reduction frameworks for Large Vision-LLMs (LVLMs) and streaming video scenarios (Zhang et al., 28 May 2025, Li et al., 12 Mar 2025). These systems are unified by the goal of reducing redundancy or extracting informative representations from video frames, thereby improving computational efficiency without significantly sacrificing downstream performance.

1. Historical Foundations and Motivation

Static video summarization seeks to represent extended video content through a small, informative set of key-frames, enabling rapid comprehension of principal events. Early approaches, including VSCAN (Mohamed et al., 2014), addressed shortcomings in existing static summarizers such as reliance on single visual descriptors (typically color), inflexibility of partitioning clustering algorithms (requiring predefined cluster counts and failing to detect 'noise' frames), and limited evaluation metrics based solely on color feature similarity.

Later iterations of VScan, as in "VScan: Rethinking Visual Token Reduction for Efficient Large Vision-LLMs" (Zhang et al., 28 May 2025) and "VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers" (Li et al., 12 Mar 2025), extend these principles to the context of LVLMs where the challenge is managing expansive visual token sequences in long or real-time video streams. The motivation is to minimize FLOPs and GPU memory consumption arising from the quadratic scaling of transformer-based attention over visual tokens and temporal frames, while avoiding loss of semantic fidelity critical for downstream multi-modal tasks.

2. Methodologies for Visual Token Reduction

2.1 Dual-Feature DBSCAN for Key-Frame Selection

The original VSCAN (Mohamed et al., 2014) employed a modified density-based spatial clustering (DBSCAN) that operates jointly over color and texture features. Frame descriptors consist of HSV color histograms (quantized to 256 bins) and texture summaries obtained from discrete Haar wavelet decompositions (approximation coefficients at level 3 across HSV channels). Dissimilarity is measured via the Bhattacharyya distance between corresponding discrete histograms, with three thresholds: EpsColor, EpsTexture (both set to 0.97), and Eps (composite score, set to 2).

Clustering proceeds by defining a neighborhood for each frame in both color and texture domains (SₑₚₛColor, SₑₚₛTexture), evaluating composite similarity (score(p,q)) and grouping frames via indirect similarity. Core frames are those with neighborhood cardinality meeting MinPts (set to 1). Noise frames (no cluster affiliation) are discarded, with final summaries consisting of one representative per ordered cluster ("middle frame").

2.2 Two-Stage Visual Token Reduction (Modern VScan Framework)

The VScan system for LVLMs (Zhang et al., 28 May 2025) introduces a training-free, two-stage token reduction algorithm. In stage one, "global-local scan + merge" is performed within the vision encoder:

  • Global Scan: Output-layer [CLS] attention identifies globally salient tokens.
  • Local Scan: Attention scores in a shallow encoder layer partition the image into windows; top tokens per window are chosen to preserve fine-grained local details.
  • Token Merging: Dropped tokens are mapped to their most similar retained token (via cosine similarity) and averaged to form merged patch representations.

Stage two prunes merged visual tokens at a middle transformer layer in the LLM, guided by attention scores from the last instruction token. The best-performing retention rates are typically R1 ≈ 16.7–25% (encoder) and R2 ≈ 33.3% (LLM), with global-local selection mixed at a 50:50 ratio.

2.3 Semantic Carrier Tokenization for Streaming Video

VideoScan (Li et al., 12 Mar 2025), targeting real-time scenarios, represents each frame with a single semantic carrier token. This embedding is obtained via average pooling of patch token embeddings in the visual encoder, optionally replaced by context-sensitive weighted aggregation. During inference ("prefilling" phase), carrier tokens are appended to the token sequence and their transformer key-value caches are preserved. In the "decoding" phase, only carrier tokens and instructions are attended by the LLM, preventing reprocessing and optimizing computational and memory efficiency. A bounded memory bank stores up to M carrier KV caches, evicting highly similar carriers to retain semantic diversity.

3. Key Algorithms, Definitions, and Hyperparameter Choices

Modified DBSCAN (Static Summarization)

Let DB(P,Q)=i=1nPiQiD_B(P, Q) = \sum_{i=1}^n \sqrt{P_i Q_i} define Bhattacharyya similarity. Neighborhoods and clusters are built via composite similarity scores, core frame definitions, and indirect similarity chaining. Clustering and key-frame selection employ specific steps and metrics, optimizing for perceptual coverage and filtering noise frames.

Two-Stage Reduction (LVLM VScan)

Let xV={xVi}i=1n\mathbf{x}_V = \{x_V^i\}_{i=1}^n, xCLSx_{CLS} be [CLS], and DD denote hidden dimension. Global and local scans leverage head-wise attention:

  • SCLSh=Softmax((QCLSKV)/D)S_{CLS}^h = \operatorname{Softmax}((Q_{CLS} K_V^\top)/\sqrt{D})
  • Average score: SCLSavg=1HhSCLShS_{CLS}^{avg} = \frac{1}{H} \sum_h S_{CLS}^h
  • Token retention by thresholds τg\tau_g, partitioning, and cosine similarity for merging.

Intermediate pruning applies attention-driven selection of visual tokens at LLM layer kk, filtering by StextavgS_{text}^{avg}.

Carrier Tokenization (VideoScan)

Let Et={vt,1,,vt,N}E_t = \{v_{t,1}, \ldots, v_{t,N}\}, Ct=1Ni=1Nvt,iC_t = \frac{1}{N} \sum_{i=1}^N v_{t,i}. Memory bank retains MM past carrier tokens and their transformer KV caches. Carrier diversity is maintained by cosine similarity-based eviction.

4. Experimental Results and Benchmarks

Static Summarization (VSCAN/DBSCAN)

On 50 Open Video Project videos (1–4 min, 30 fps, 352×240) each summarized by five users, VSCAN (Mohamed et al., 2014) achieved a mean F-measure of 0.77, outperforming baselines OV (0.67), DT (0.61), STIMO (0.65), VSUMM (0.72), and DB-Color (0.74). The dual-feature DBSCAN and enhanced color+texture evaluation yielded the highest perceptual summary quality.

LVLM Visual Token Reduction

In (Zhang et al., 28 May 2025), applying VScan to LLaVA-NeXT-7B reduced FLOPs by 10× and achieved a 2.91× prefill speedup, retaining 95.4% of original performance (2,880→320 tokens). Qwen-2.5-VL-7B retained 80.7% accuracy at 25% token budget, exceeding PyramidDrop at 50.4%. Video-LLaVA-7B using 25% token budget maintained ≈100% accuracy across several video benchmarks.

Streaming Video (VideoScan)

VideoScan (Li et al., 12 Mar 2025) on LLaVA-Video-7B reduced vision-related FLOPs by ≈99%, with GPU memory consumption remaining stable at ≈18 GB regardless of video duration. In online streaming QA, VideoScan matched or exceeded ReKV in accuracy while being 1.29× faster and using 50% less VRAM.

Representative Results Table

System Speed (FPS) VRAM (GB) Accuracy (%)
LLaVA-Video-7B (baseline) 1 36 63.3
VideoScan [M=128] 6 18 55.1
VScan (LVLM) 2.91× (prefill) 95.4 (rel.)

5. Evaluation Methodologies

Static summarization is evaluated by matching automatically generated summaries to user references via Bhattacharyya thresholding (≥0.97) on color or texture, computing precision, recall, and their harmonic mean F-measure (Mohamed et al., 2014). LVLM VScan and VideoScan frameworks employ external benchmarks (MVBench, MLVU, TGIF, MSVD, MSRVTT, ActivityNet) and compare retained accuracy against baseline models at stringent token budgets (Zhang et al., 28 May 2025, Li et al., 12 Mar 2025).

Ablation studies demonstrate sensitivity to global-local ratios, scan layers, and intermediate pruning positions, confirming that balanced token selection and mid-model attention alignment yield optimal efficiency and performance retention.

6. Practical Implications and Limitations

VScan systems enable real-time video understanding at consumer GPU scales (6 FPS on RTX 4090, stable 18 GB VRAM (Li et al., 12 Mar 2025)), facilitating deployment in domains such as robotics navigation, AR glasses, and live video assistants. The token reduction approaches allow LVLMs to scale to longer video inputs and larger batch sizes without exceeding hardware constraints.

Limitations include incompatibility with pure transformer-encoder architectures (semantic carrier flow depends on autoregressive decoding paths) and reduced performance on tasks demanding pixel-level precision due to spatial detail loss in severe token compression. Future directions include hybrid carrier-local token frameworks and instruction-guided frame retrieval (with careful memory management).

7. Contextual Significance and Comparative Analysis

VScan frameworks advance visual token reduction paradigms by integrating attention-driven selection, semantic carrier representation, and density-based clustering, surpassing prior art such as ToMe, FastV, SparseVLM, PyramidDrop, and VisionZip on both efficiency and retained accuracy (Zhang et al., 28 May 2025, Li et al., 12 Mar 2025). The transition from clustering static video to semantic carrier-based streaming LVLM inference reflects the evolution of visual token management in response to demands for scalable, real-time multi-modal intelligence.

A plausible implication is that future large-scale vision-language architectures will increasingly rely on dynamic, adaptive token selection and aggregation strategies resembling those in VScan and VideoScan to mitigate the costs of high-fidelity visual processing.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VScan.