Video-to-Semantic-Tokens (VST)
- Video-to-Semantic-Tokens (VST) are algorithmic frameworks that convert high-dimensional video data into discrete, semantically rich token representations.
- VST methods utilize techniques like vector quantization, graph-based token merging, and trajectory tracking to ensure efficient, context-aware tokenization.
- By enabling significant token reduction with minimal accuracy loss, VST enhances video processing in large language models and multimodal systems.
Video-to-Semantic-Tokens (VST) refers to the set of algorithmic frameworks and architectural modules that convert continuous, high-dimensional raw video data into compact, discrete, semantically meaningful token representations. Unlike patch- or frame-level discretization, VST methods seek to bridge the information density mismatch between visual data and language modalities, enabling memory-efficient, scalable, and interpretable video reasoning within LLMs, video transformers, and multimodal systems.
1. Core Methodologies for Video-to-Semantic-Tokens
A diverse set of approaches has been proposed for realizing VST, unified by the objective of extracting discrete or compressed semantic units from video. The principal paradigms include:
- Semantic Vector-Quantization (VQ): E-ViLM utilizes a vector-quantized tokenizer, learning a codebook to quantize encoder outputs by nearest-neighbor assignment. Discretized patch embeddings are then supervised to predict caption words, integrating high-level semantics (Fang et al., 2023).
- Graph-Theoretic Token Merging: LLaVA-Scissor adopts unsupervised, training-free graph clustering by computing cosine similarities between dense tokens, forming non-overlapping semantic connected components at both spatial and temporal levels, and averaging component tokens to produce semantic representatives (Sun et al., 27 Jun 2025).
- Trajectory-Based Grounded Tokenization: TrajViT replaces patch-based tokens with tokens grounded in temporally coherent sub-object trajectories obtained via panoptic segmentation and tracking, yielding a semantic granularity that adapts to scene complexity rather than video duration (Zheng et al., 29 May 2025).
- Discrete Quantization in Learned Semantic Spaces: LVLM-VAR processes videos with a visual encoder backbone, applies temporal self-attention, and vector-quantizes the result into a codebook, yielding a discrete semantic token sequence for downstream reasoning and explanation via LVLMs (Peng et al., 6 Sep 2025).
- Summary-Token Compression in LLMs: Video-XL interleaves special Visual Summarization Tokens (VSTs) within intervals of visual tokens, dynamically summarizes key/value representations at each transformer layer, and discards raw visual K/Vs to enable hour-scale video context within memory constraints (Shu et al., 2024).
- Semantic-Aware Query Autoencoding: SweetTok employs a Decoupled Query AutoEncoder (DQAE) with learnable spatial and temporal queries, which are quantized via a motion-enhanced language codebook constructed from LLM-embedded video captions, enforcing both spatial and temporal semantic compression (Tan et al., 2024).
The following table summarizes representative VST paradigms:
| Approach | Compression Principle | Token Semantics |
|---|---|---|
| E-ViLM (Fang et al., 2023) | VQ over patch embeddings; MVM objective | VQ codebook, caption semantics |
| LLaVA-Scissor (Sun et al., 27 Jun 2025) | SCC clustering of dense tokens | Non-overlapping visual regions |
| TrajViT (Zheng et al., 29 May 2025) | Panoptic object trajectories | Object-centric tokens |
| LVLM-VAR (Peng et al., 6 Sep 2025) | Vector-quantization of context features | Action-centric tokens |
| Video-XL (Shu et al., 2024) | Interval-wise VST summarization | Intervalwise visual summary |
| SweetTok (Tan et al., 2024) | DQAE + Language-prior codebook | Language-grounded spatial+temporal tokens |
2. Semantic Token Construction and Quantization
Fundamental to VST methods is the conversion of high-dimensional video features into compact discrete representations:
- Vector-Quantized Tokenizers: The backbone encoder generates per-patch or per-segment features, which are quantized to the nearest codebook vector. E-ViLM uses a VQ codebook of size , ; SweetTok's codebooks are built from language-model-derived noun/verb embeddings (Fang et al., 2023, Tan et al., 2024). The VQ process serves both as compression and an implicit semantic clustering step.
- Graph and Trajectory Assignment: Non-parametric methods (e.g., LLaVA-Scissor) compute token–token affinity graphs, extract spatial/temporal connected components, and represent each region by averaging member tokens. TrajViT instead traces panoptic object trajectories via segmentation and tracking, ensuring that each semantic token tracks an entity or sub-object across frames (Zheng et al., 29 May 2025).
- Summary Token Mechanisms: In Video-XL, summary VST tokens absorb key/value information for entire intervals, with memory-space collapse at every transformer layer, preserving critical contextual information under vast token reduction (Shu et al., 2024).
- Language-Guided Semantic Codebooks: SweetTok explicitly aligns quantizer codebooks with high-frequency caption words separated by part-of-speech (spatial: nouns/adjectives, temporal: verbs/adverbs), with a GCN projector mapping LLM-embeddings to the visual token space, enforcing cross-modal semantic alignment (Tan et al., 2024).
- Loss Objectives: Common formulations extend VQ-VAE style commitment and reconstruction losses. Examples include multi-label classification over caption words (E-ViLM), MVM cross-entropy over quantized tokens, proxy-code classification with auxiliary supervision (SweetTok), and CLIP-style contrastive learning (TrajViT).
3. Compression, Efficiency, and Retention Analysis
VST methods are designed for orders-of-magnitude token reduction while preserving semantic completeness and fidelity:
- Retention Ratio and Thresholds: LLaVA-Scissor achieves sub-10% token retention with precision-controlled similarity thresholds in SCC clustering, maintaining accuracy on video-QA and comprehension tasks at (Sun et al., 27 Jun 2025). Video-XL demonstrates task retention even at compression on long-video understanding benchmarks (Shu et al., 2024).
- Complexity: The computational burden is dominated by similarity graph computations (LLaVA-Scissor: ) or per-interval key/value summarization (Video-XL), but both allow massive FLOPs and memory savings in subsequent transformer layers—empirically reducing inference cost by $80$–.
- Token Scaling Behavior: TrajViT's token count is invariant to video length, instead scaling with scene complexity, as object-based tokenization avoids the redundancy of patch-based spatial-temporal grids (Zheng et al., 29 May 2025). SweetTok achieves effective compression ratios of while maintaining high generative fidelity in downstream tasks (Tan et al., 2024).
4. Integration into Large and Multimodal LLMs
VST modules are directly integrated into modern multimodal transformers and LLMs:
- Input to LVLMs: Standard practice forms a combined multimodal sequence by concatenating (or interleaving) semantic tokens with textual prompts or instructions, processed by a pretrained or LoRA-fine-tuned LVLM (e.g., Qwen2-7B, LLaVA-13B). Token embedding dimensions are conformed to the LVLM architecture (typically or $1024$) (Peng et al., 6 Sep 2025, Shu et al., 2024).
- Instruction Tuning: Video-XL applies joint instruction-tuning with curriculum learning over compression ratios and composite data curation (single-image/multi-image/synthetic video QA) to facilitate robust VST summarization and reasoning (Shu et al., 2024).
- Interpretable and Few-Shot Tasks: Discrete VSTs can be inversely mapped to human-parsable keywords (SweetTok), leveraged for few-shot recognition in LLMs without extra adapters. Direct code index prompting or class-ID conditioning is used in recognition or autoregressive generation downstream (Tan et al., 2024).
- Temporal and Semantic Consistency: Temporal self-attention, trajectory-based token grouping, and codebook semantic alignment encourage the resulting token streams to be both semantically interpretable and temporally stable, as seen in LVLM-VAR and TrajViT (Peng et al., 6 Sep 2025, Zheng et al., 29 May 2025).
5. Quantitative Performance and Empirical Results
Empirical studies demonstrate that VST enables compact video modeling with minimal performance degradation:
- E-ViLM: Achieves top-1 accuracy on MSRVTT-QA at model size and compute compared to SoTA, with codebook utilization and robust generalization on retrieval/action recognition (Fang et al., 2023).
- LLaVA-Scissor: Retains baseline accuracy at retention and at on a range of video-QA and long-video understanding benchmarks. Outperforms attention-based or uniform sampling especially at aggressive compression () (Sun et al., 27 Jun 2025).
- TrajViT: Delivers average top-5 recall improvement (e.g., : recall points on vid2txt retrieval) with token reduction, and average gain on VideoQA using VideoLLM backbones, accompanied by faster training and reduction in inference vision FLOPs (Zheng et al., 29 May 2025).
- Video-XL: Retains of upper-bound accuracy at compression on multi-hour video workloads; supports processing of 2000 frames on a single A100 (34 GB at 2048 frames) (Shu et al., 2024).
- SweetTok: Improves reconstructed video rFVD by on UCF-101 and generative gFVD by , using five times fewer tokens than prior methods; supports symbolic recognition tasks directly by mapping tokens to POS-filtered caption vocabulary entries (Tan et al., 2024).
6. Design Trade-offs and Future Directions
VST consolidates several major trends in efficient multimodal learning:
- Redundancy Exploitation: Token redundancy in video is substantial—most prior art loses accuracy with token removal, revealing excessive spatio-temporal overlap in patch representations (Sun et al., 27 Jun 2025).
- Adaptive Granularity: Scene complexity, not video length, is the appropriate scaling dimension for semantic tokenization, as manifested most clearly in trajectory- and SCC-based approaches (Zheng et al., 29 May 2025, Sun et al., 27 Jun 2025).
- Semantic Alignment: Incorporation of linguistic priors—either through supervised caption alignment or codebook construction based on LLM token vocabularies—enables more interpretable and transferable token semantics (Tan et al., 2024).
- Downstream Flexibility: VST representations are compatible with both generative (autoregressive video synthesis) and discriminative (retrieval, VQA, recognition) workloads, and offer interpretability via explicit mapping back to sub-object, action, or linguistic entities.
Potential lines of future research include tighter coupling between semantic token formation and instruction-conditional pruning, integration with on-the-fly resampler modules for dynamic compute allocation, and domain adaptation to handle low-shot, cross-modal, or highly dynamic content.
7. Representative Frameworks and Implementation Details
The following table collates salient statistics and configuration parameters for leading VST methods:
| Model | Tokenizer Principle | Codebook Size | Token Budget | Noted Benchmarks |
|---|---|---|---|---|
| E-ViLM | VQ, caption MLP decoder | 9,420 | 4 frames × patches | MSRVTT, TGIF, MSVD |
| LLaVA-Scissor | Graph SCC (unsupervised) | N/A | ACTIVITYNET, NEXT-QA, MVBench | |
| TrajViT | Panoptic trajectories | N/A | grid | ActivityNet, VATEX, VideoQA 6-bench |
| Video-XL | KV interval summarization | N/A | $1/16$ | MLVU, VideoMME, MME, LongBench |
| SweetTok | DQAE+MLC language codebook | 1,280 | UCF-101, gFVD, few-shot LLMs |
Each approach is characterized by distinct tradeoffs in codebook granularity, supervision signals, and integration strategy, but all achieve substantial gains in context length, compute efficiency, and interpretability over legacy patch-based models.
For comprehensive architectural details, ablation studies, and pseudocode, see the cited papers: E-ViLM (Fang et al., 2023), LLaVA-Scissor (Sun et al., 27 Jun 2025), TrajViT (Zheng et al., 29 May 2025), LVLM-VAR (Peng et al., 6 Sep 2025), Video-XL (Shu et al., 2024), and SweetTok (Tan et al., 2024).