Video Large Language Models

Updated 5 October 2025

Video Large Language Models (VLMs) are multimodal neural architectures that fuse video frames and text to support reasoning, captioning, and anomaly detection.
They employ methods like cross-attention fusion and progressive token compression to efficiently process temporal dynamics in long videos.
Applications span surveillance, transportation, and media analytics, leveraging open-vocabulary generalization and self-supervised training techniques.

Video LLMs (VLMs) are multimodal neural architectures that extend the capabilities of text-centric LLMs by tightly integrating video understanding. These systems jointly process temporally ordered visual inputs—such as frames or segments from video streams—alongside natural language, enabling open-ended reasoning, retrieval, captioning, anomaly detection, action recognition, and diverse query-driven analytics over video content. Modern VLMs synthesize advances in vision-language pretraining, efficient visual token compression, sophisticated temporal modeling, and question-aware selection, targeting both granular spatial phenomena and temporal dynamics across long-form or domain-specific videos.

1. Foundations and Model Architectures

At their core, Video LLMs comprise (1) a visual encoder that transforms frames, clips, or features (sometimes from multiple modalities including optical flow and audio spectrograms) into high-dimensional embeddings; (2) alignment or “connector” modules mapping visual representations to the token space of LLMs; (3) a LLM—often a parameter-rich transformer—that generates or interprets language conditioned on visual context.

Several design patterns have emerged:

Early fusion (token concatenation): Visual and text tokens concatenated at input to the LLM, enabling direct cross-modal context (Akter et al., 2 Jul 2025).
Late fusion (separate heads/branches): Visual and language features are independently encoded and combined via intermediate modules before reaching a reasoning head (Akter et al., 2 Jul 2025).
Cross-attention fusion: Language tokens attend over temporally-encoded visual features, supporting dynamic inter-modal reasoning and fine temporal alignment (Qian et al., 2022).
Progressive alignment: Hierarchies of token or segment-level encodings, sometimes with global and local summaries, preserve both fine and broad context (Weng et al., 4 Apr 2024).
Memory propagation and streaming: Historical clip-level memory representations are propagated and updated per video segment, supporting arbitrary-length input with constant memory and efficient retrieval for question answering (Qian et al., 25 May 2024).

The connector may implement a single linear projection for simplicity (e.g., Video-LLaVA baseline (Li et al., 2023)), a learned MLP, a Q-Former-like resampler, or hierarchical memory compressors.

2. Temporal and Multimodal Representation

Temporal understanding is foundational for VLMs given the sequential and dynamic nature of video. Key mechanisms include:

Temporal Transformers: Stacked self-attention layers over per-frame or per-modality representations aggregate temporal dynamics. MOV, for instance, attaches L-layer temporal transformers to both video and optical flow branches (Qian et al., 2022).
Explicit token merging: Hierarchical or bipartite matching merges similar tokens over local (segment) windows, dramatically reducing token count while concentrating salient temporal features (Weng et al., 4 Apr 2024).
Temporal encoders: Learnable spatio-temporal pooling or sequential Memory Modules (Token Turing Machines) reduce thousands of tokens into compact, order-preserving summaries (e.g., 32 tokens for an entire video (Ryoo et al., 21 Oct 2024)).
Adaptive/causal temporal attention: Causal attention in progressive encoding modules accumulates information across frames, ensuring only novel or evolving details are encoded in the current step (Yang et al., 12 Dec 2024).
Propagated memory: Streaming architectures maintain a condensed memory state per clip, iteratively refined as each segment is processed, with question-specific memory selection upon inference (Qian et al., 25 May 2024).

Multimodal extensions—incorporating, e.g., optical flow or audio directly as input channels—have shown strong gains in both performance and generalization (Qian et al., 2022). Fusion is orchestrated via cross-attention, allowing one modality (e.g., video stream) to conditionally attend to another (optical flow/audio) and vice versa.

3. Efficient Visual Token Compression and Query-Awareness

Efficiency is a critical concern as naive frame stacking and tokenization cause token counts and memory to scale prohibitively with longer videos or higher spatial resolution. Recent advances include:

Super image construction: Collapsing sequences of consecutive frames into a single NxN grid “super image,” reducing the required number of visual encoder passes by a factor of $1/N^2$ (Nishimura et al., 2023).
PVC (Progressive Visual Token Compression): All images are standardized as (possibly static) videos, using adaptive compression (e.g., via PixelShuffle) and progressive causal attention to avoid redundancy in repeated or similar frames (Yang et al., 12 Dec 2024).
Query-attentive video compression: Salient features for long videos are aggregated into pseudo-image tokens using attention scores derived directly from the textual query (Wang et al., 9 Apr 2025). This ensures relevant temporal cues are retained within the token budget of the LLM.
Keyframe-oriented vision token pruning (KVTP): Frame-level relevance to a downstream query, predicted via a learned similarity function, dictates per-frame token retention via a soft, normalized distribution. This retains sparse but crucial visual cues with 80% reduction in token usage, without degrading performance (Liu et al., 13 Mar 2025).
Inverse transform sampling: At inference time, query-guided frame selection based on visual-text similarity outperforms uniform sampling, especially in noisy, multi-source scenarios (Liu et al., 27 Mar 2025).
Elastic compressors: SpecVLM adaptively selects among pruning, pooling, convolution, or resampler primitives per input, optimizing FLOPs and maintaining accuracy in speculative decoding scenarios (Huang et al., 15 Sep 2025).

4. Open-Vocabulary and Generalization

Open-vocabulary classification and retrieval—where emphasis is on recognizing or retrieving concepts (or action/scene descriptions) not seen during training—are central to modern VLM design.

Pretrained image–text models (e.g., CLIP) serve as foundations, supporting zero-shot transfer via shared embedding spaces (Qian et al., 2022). MOV and related schemes exploit cross-modal fusion and calibration to prevent overfitting to base classes and instead better retain accuracy on novel categories (Qian et al., 2022).
Generalization to unseen action categories, complex composite predicates, or temporally extended/rare events has been demonstrated on curated zero-shot and long-video datasets (e.g., UCF101, HMDB51, Kinetics-700, VGGSound, MVBench) (Qian et al., 2022, Hong et al., 29 Aug 2024).

Query-attentive token selection and transformer-based aggregation enhance both base-class accuracy and novel-class transfer, particularly when fine-tuning only peripheral modules or alignment layers rather than the entire backbone (Wang et al., 9 Apr 2025).

Instruction-tuned self-training frameworks such as VideoSAVi synthesize internal questions and preference pairs, employing self-critique mechanisms (CLIP filtering, DPO loss) to enhance temporal and spatial reasoning without external human annotation (Kulkarni et al., 1 Dec 2024).

5. Training, Optimization, and Evaluation Protocols

Several distinctive training and evaluation approaches have been reported:

Lightweight/frozen backbone training: Only shallow modules (e.g., adapters, alignment layers, or MLPs) are tuned atop large pretrained backbone encoders, enabling rapid adaptation with moderate datasets (e.g., 10k video–text pairs) and strong performance on long video understanding benchmarks (Wang et al., 9 Apr 2025).
Self-aligned preference optimization: Direct Preference Optimization (DPO) with CLIP-based filtering aligns responses to visual content; training alternates between question/answer synthesis, self-evaluation, and correction (Kulkarni et al., 1 Dec 2024).
Speculative decoding and online-logit distillation: Speedups of 2.5–2.9× are achieved by running a fast draft model to propose tokens and verifying them via a heavier target model. Online distillation aligns outputs at both the logit and intermediate feature levels (Huang et al., 15 Sep 2025).
Unified evaluation frameworks: Multi-task, multi-metric schemes blending LLM-based (e.g., GPT-3.5 as surrogate human rater), retrieval, and classic natural language metrics assess video QA, captioning, retrieval, and action understanding (Li et al., 2023). These unified setups help set community benchmarks, e.g., MVBench, MLVU, Kinetics-700, VideoMME, VideoChatGPT-Bench.
Training-free and inference-optimized pipelines: Some VLMs adopt a modular, post-hoc approach to video understanding, generating and verifying dense captions with LLM/VLM ensembles and leveraging object detection as grounding (Wu et al., 22 Jul 2025).

6. Application Domains and Societal Impact

VLMs are deployed across a variety of domains, such as:

Surveillance and anomaly detection: Frame-wise captioning pipelines, prompt-driven LLM scoring, and effective token alignment modules (SETS/TETG) support rapid anomaly detection and localization without dedicated retraining (Zanella et al., 1 Apr 2024, Chen et al., 8 Aug 2025, Silva et al., 6 Jan 2025).
Intelligent transportation systems: Crash detection leverages cross-modal fusion architectures and specialized datasets (DAD, BDD100K, CADP), with real-time reasoning and generative reporting (Akter et al., 2 Jul 2025).
Video analytics, retrieval, and summarization: Top-K frame retrieval systems employ synonym and discriminator augmentation, semantic-rich de-redundancy, and MAP/APS evaluation (Romero et al., 2023). Textual summaries offer searchable, minimally stored representations for exhaustive archives (Silva et al., 6 Jan 2025).
Content generation, summarization, and marketing: Structured narration frameworks provide time-aligned captions and factual context, facilitating downstream QA and content indexing for advertising (Wu et al., 22 Jul 2025).
Open-world robotics, assistive tech, science/medicine, educational multimedia: Few-shot video-instruction pairs and fine-grained reasoning extend VLM utility beyond trimmed, pre-labeled benchmarks (Li et al., 2023).

7. Contemporary Challenges and Future Directions

Current limitations, open challenges, and promising trajectories for video LLMs include:

Fine-grained spatiotemporal alignment and grounding: While global or pooled representations are efficient, capturing subtle, temporally anchored details remains difficult, motivating advances like hierarchical token merging, streaming memory, effective token selection, and anomaly-aware classifiers (Weng et al., 4 Apr 2024, Qian et al., 25 May 2024, Chen et al., 8 Aug 2025).
Computational and memory bottlenecks: Quadratic scaling with token count necessitates continual development of more aggressive, input-adaptive compression (PVC, elastic compressors), as well as inference optimization (SpecVLM) (Yang et al., 12 Dec 2024, Huang et al., 15 Sep 2025).
Data scarcity and annotation: Rare events (e.g., crashes, anomalies) and cross-domain generalization require either efficient augmentation/simulation, synthetic video–text pairs, or fully self-supervised pipelines (Akter et al., 2 Jul 2025, Kulkarni et al., 1 Dec 2024).
Robustness, explainability, and real-time processing: Handling out-of-distribution content, minimizing hallucination, and achieving low-latency responses are active challenges (Akter et al., 2 Jul 2025, Silva et al., 6 Jan 2025, Wu et al., 22 Jul 2025).
Unification of image and video modalities: Joint frameworks that represent images as “static videos” and apply progressive, causal token encoding facilitate unified architectures and reduce redundancy (Yang et al., 12 Dec 2024).
Open-source contributions and scalable benchmarking: Repositories such as CogVLM2/GLM-4V (Hong et al., 29 Aug 2024) and standardized multi-task evaluation sets are enabling broader, reproducible progress.

A plausible implication is that continued advances in efficient token selection, self-supervised optimization, and compressive temporal modeling will catalyze deployment of VLMs for real-world analysis, retrieval, and interactive question answering over arbitrary-length and open-domain video data streams.