Video-to-Semantic-Tokens (VST)

Updated 5 February 2026

Video-to-Semantic-Tokens (VST) are algorithmic frameworks that convert high-dimensional video data into discrete, semantically rich token representations.
VST methods utilize techniques like vector quantization, graph-based token merging, and trajectory tracking to ensure efficient, context-aware tokenization.
By enabling significant token reduction with minimal accuracy loss, VST enhances video processing in large language models and multimodal systems.

Video-to-Semantic-Tokens (VST) refers to the set of algorithmic frameworks and architectural modules that convert continuous, high-dimensional raw video data into compact, discrete, semantically meaningful token representations. Unlike patch- or frame-level discretization, VST methods seek to bridge the information density mismatch between visual data and language modalities, enabling memory-efficient, scalable, and interpretable video reasoning within LLMs, video transformers, and multimodal systems.

1. Core Methodologies for Video-to-Semantic-Tokens

A diverse set of approaches has been proposed for realizing VST, unified by the objective of extracting discrete or compressed semantic units from video. The principal paradigms include:

Semantic Vector-Quantization (VQ): E-ViLM utilizes a vector-quantized tokenizer, learning a codebook $C = \{e_1, \dots, e_K\} \subset \mathbb{R}^D$ to quantize encoder outputs by nearest-neighbor assignment. Discretized patch embeddings are then supervised to predict caption words, integrating high-level semantics (Fang et al., 2023).
Graph-Theoretic Token Merging: LLaVA-Scissor adopts unsupervised, training-free graph clustering by computing cosine similarities between dense tokens, forming non-overlapping semantic connected components at both spatial and temporal levels, and averaging component tokens to produce semantic representatives (Sun et al., 27 Jun 2025).
Trajectory-Based Grounded Tokenization: TrajViT replaces patch-based tokens with tokens grounded in temporally coherent sub-object trajectories obtained via panoptic segmentation and tracking, yielding a semantic granularity that adapts to scene complexity rather than video duration (Zheng et al., 29 May 2025).
Discrete Quantization in Learned Semantic Spaces: LVLM-VAR processes videos with a visual encoder backbone, applies temporal self-attention, and vector-quantizes the result into a codebook, yielding a discrete semantic token sequence for downstream reasoning and explanation via LVLMs (Peng et al., 6 Sep 2025).
Summary-Token Compression in LLMs: Video-XL interleaves special Visual Summarization Tokens (VSTs) within intervals of visual tokens, dynamically summarizes key/value representations at each transformer layer, and discards raw visual K/Vs to enable hour-scale video context within memory constraints (Shu et al., 2024).
Semantic-Aware Query Autoencoding: SweetTok employs a Decoupled Query AutoEncoder (DQAE) with learnable spatial and temporal queries, which are quantized via a motion-enhanced language codebook constructed from LLM-embedded video captions, enforcing both spatial and temporal semantic compression (Tan et al., 2024).

The following table summarizes representative VST paradigms:

Approach	Compression Principle	Token Semantics
E-ViLM (Fang et al., 2023)	VQ over patch embeddings; MVM objective	VQ codebook, caption semantics
LLaVA-Scissor (Sun et al., 27 Jun 2025)	SCC clustering of dense tokens	Non-overlapping visual regions
TrajViT (Zheng et al., 29 May 2025)	Panoptic object trajectories	Object-centric tokens
LVLM-VAR (Peng et al., 6 Sep 2025)	Vector-quantization of context features	Action-centric tokens
Video-XL (Shu et al., 2024)	Interval-wise VST summarization	Intervalwise visual summary
SweetTok (Tan et al., 2024)	DQAE + Language-prior codebook	Language-grounded spatial+temporal tokens

2. Semantic Token Construction and Quantization

Fundamental to VST methods is the conversion of high-dimensional video features into compact discrete representations:

Vector-Quantized Tokenizers: The backbone encoder generates per-patch or per-segment features, which are quantized to the nearest codebook vector. E-ViLM uses a VQ codebook of size $K=9\,420$ , $D=32$ ; SweetTok's codebooks are built from language-model-derived noun/verb embeddings (Fang et al., 2023, Tan et al., 2024). The VQ process serves both as compression and an implicit semantic clustering step.
Graph and Trajectory Assignment: Non-parametric methods (e.g., LLaVA-Scissor) compute token–token affinity graphs, extract spatial/temporal connected components, and represent each region by averaging member tokens. TrajViT instead traces panoptic object trajectories via segmentation and tracking, ensuring that each semantic token tracks an entity or sub-object across frames (Zheng et al., 29 May 2025).
Summary Token Mechanisms: In Video-XL, summary VST tokens absorb key/value information for entire intervals, with memory-space collapse at every transformer layer, preserving critical contextual information under vast token reduction (Shu et al., 2024).
Language-Guided Semantic Codebooks: SweetTok explicitly aligns quantizer codebooks with high-frequency caption words separated by part-of-speech (spatial: nouns/adjectives, temporal: verbs/adverbs), with a GCN projector mapping LLM-embeddings to the visual token space, enforcing cross-modal semantic alignment (Tan et al., 2024).
Loss Objectives: Common formulations extend VQ-VAE style commitment and reconstruction losses. Examples include multi-label classification over caption words (E-ViLM), MVM cross-entropy over quantized tokens, proxy-code classification with auxiliary supervision (SweetTok), and CLIP-style contrastive learning (TrajViT).

3. Compression, Efficiency, and Retention Analysis

VST methods are designed for orders-of-magnitude token reduction while preserving semantic completeness and fidelity:

Retention Ratio and Thresholds: LLaVA-Scissor achieves sub-10% token retention with precision-controlled similarity thresholds $\tau$ in SCC clustering, maintaining $>95\%$ accuracy on video-QA and comprehension tasks at $r=0.1$ (Sun et al., 27 Jun 2025). Video-XL demonstrates $>98\%$ task retention even at $16\times$ compression on long-video understanding benchmarks (Shu et al., 2024).
Complexity: The computational burden is dominated by similarity graph computations (LLaVA-Scissor: $O(N^2 d)$ ) or per-interval key/value summarization (Video-XL), but both allow massive FLOPs and memory savings in subsequent transformer layers—empirically reducing inference cost by $80$– $K=9\,420$ 0.
Token Scaling Behavior: TrajViT's token count is invariant to video length, instead scaling with scene complexity, as object-based tokenization avoids the redundancy of patch-based spatial-temporal grids (Zheng et al., 29 May 2025). SweetTok achieves effective compression ratios of $K=9\,420$ 1 while maintaining high generative fidelity in downstream tasks (Tan et al., 2024).

4. Integration into Large and Multimodal LLMs

VST modules are directly integrated into modern multimodal transformers and LLMs:

Input to LVLMs: Standard practice forms a combined multimodal sequence by concatenating (or interleaving) semantic tokens with textual prompts or instructions, processed by a pretrained or LoRA-fine-tuned LVLM (e.g., Qwen2-7B, LLaVA-13B). Token embedding dimensions are conformed to the LVLM architecture (typically $K=9\,420$ 2 or $K=9\,420$ 3) (Peng et al., 6 Sep 2025, Shu et al., 2024).
Instruction Tuning: Video-XL applies joint instruction-tuning with curriculum learning over compression ratios and composite data curation (single-image/multi-image/synthetic video QA) to facilitate robust VST summarization and reasoning (Shu et al., 2024).
Interpretable and Few-Shot Tasks: Discrete VSTs can be inversely mapped to human-parsable keywords (SweetTok), leveraged for few-shot recognition in LLMs without extra adapters. Direct code index prompting or class-ID conditioning is used in recognition or autoregressive generation downstream (Tan et al., 2024).
Temporal and Semantic Consistency: Temporal self-attention, trajectory-based token grouping, and codebook semantic alignment encourage the resulting token streams to be both semantically interpretable and temporally stable, as seen in LVLM-VAR and TrajViT (Peng et al., 6 Sep 2025, Zheng et al., 29 May 2025).

5. Quantitative Performance and Empirical Results

Empirical studies demonstrate that VST enables compact video modeling with minimal performance degradation:

E-ViLM: Achieves $K=9\,420$ 4 top-1 accuracy on MSRVTT-QA at $K=9\,420$ 5 model size and $K=9\,420$ 6 compute compared to SoTA, with $K=9\,420$ 7 codebook utilization and robust generalization on retrieval/action recognition (Fang et al., 2023).
LLaVA-Scissor: Retains $K=9\,420$ 8 baseline accuracy at $K=9\,420$ 9 retention and $D=32$ 0 at $D=32$ 1 on a range of video-QA and long-video understanding benchmarks. Outperforms attention-based or uniform sampling especially at aggressive compression ( $D=32$ 2) (Sun et al., 27 Jun 2025).
TrajViT: Delivers $D=32$ 3 average top-5 recall improvement (e.g., $D=32$ 4: $D=32$ 5 recall points on vid2txt retrieval) with $D=32$ 6 token reduction, and $D=32$ 7 average gain on VideoQA using VideoLLM backbones, accompanied by $D=32$ 8 faster training and $D=32$ 9 reduction in inference vision FLOPs (Zheng et al., 29 May 2025).
Video-XL: Retains $\tau$ 0 of upper-bound accuracy at $\tau$ 1 compression on multi-hour video workloads; supports processing of $\tau$ 22000 frames on a single A100 ( $\tau$ 334 GB at 2048 frames) (Shu et al., 2024).
SweetTok: Improves reconstructed video rFVD by $\tau$ 4 on UCF-101 and generative gFVD by $\tau$ 5, using five times fewer tokens than prior methods; supports symbolic recognition tasks directly by mapping tokens to POS-filtered caption vocabulary entries (Tan et al., 2024).

6. Design Trade-offs and Future Directions

VST consolidates several major trends in efficient multimodal learning:

Redundancy Exploitation: Token redundancy in video is substantial—most prior art loses $\tau$ 6 accuracy with $\tau$ 7 token removal, revealing excessive spatio-temporal overlap in patch representations (Sun et al., 27 Jun 2025).
Adaptive Granularity: Scene complexity, not video length, is the appropriate scaling dimension for semantic tokenization, as manifested most clearly in trajectory- and SCC-based approaches (Zheng et al., 29 May 2025, Sun et al., 27 Jun 2025).
Semantic Alignment: Incorporation of linguistic priors—either through supervised caption alignment or codebook construction based on LLM token vocabularies—enables more interpretable and transferable token semantics (Tan et al., 2024).
Downstream Flexibility: VST representations are compatible with both generative (autoregressive video synthesis) and discriminative (retrieval, VQA, recognition) workloads, and offer interpretability via explicit mapping back to sub-object, action, or linguistic entities.

Potential lines of future research include tighter coupling between semantic token formation and instruction-conditional pruning, integration with on-the-fly resampler modules for dynamic compute allocation, and domain adaptation to handle low-shot, cross-modal, or highly dynamic content.

7. Representative Frameworks and Implementation Details

The following table collates salient statistics and configuration parameters for leading VST methods:

Model	Tokenizer Principle	Codebook Size	Token Budget	Noted Benchmarks
E-ViLM	VQ, caption MLP decoder	9,420	4 frames × patches	MSRVTT, TGIF, MSVD
LLaVA-Scissor	Graph SCC (unsupervised)	N/A	$\tau$ 8	ACTIVITYNET, NEXT-QA, MVBench
TrajViT	Panoptic trajectories	N/A	$\tau$ 9grid	ActivityNet, VATEX, VideoQA 6-bench
Video-XL	KV interval summarization	N/A	$>95\%$ 0	MLVU, VideoMME, MME, LongBench
SweetTok	DQAE+MLC language codebook	1,280	$>95\%$ 1	UCF-101, gFVD, few-shot LLMs

Each approach is characterized by distinct tradeoffs in codebook granularity, supervision signals, and integration strategy, but all achieve substantial gains in context length, compute efficiency, and interpretability over legacy patch-based models.

For comprehensive architectural details, ablation studies, and pseudocode, see the cited papers: E-ViLM (Fang et al., 2023), LLaVA-Scissor (Sun et al., 27 Jun 2025), TrajViT (Zheng et al., 29 May 2025), LVLM-VAR (Peng et al., 6 Sep 2025), Video-XL (Shu et al., 2024), and SweetTok (Tan et al., 2024).