Video Large Language Models Overview

Updated 11 December 2025

VLLMs are neural architectures that achieve unified spatio-temporal understanding by integrating visual, audio, and text modalities.
They use a combination of visual encoders, adapters, and large language models to enable zero-shot and few-shot performance with efficient token pruning and streaming mechanisms.
Applications include advanced driver-assistance, anomaly detection, and multi-video collaborative reasoning, highlighting ongoing improvements in factuality and scalability.

A Video LLM (VLLM) is a neural architecture designed for unified spatio-temporal understanding, perception, and language generation from video input. VLLMs integrate visual (typically frame-wise image features) and often additional modalities such as audio, align them to the embedding space of a LLM, and decode open-ended text, answers, or structured outputs conditioned on both the video and user prompts. The field is characterized by rapidly advancing multimodal architectures, expanding benchmark suites, continual improvements in factual and temporal reasoning, and efforts to make inference computationally tractable even for long, complex videos.

1. Core Architectures and Design Paradigms

Most VLLMs are constructed around three key modules: a visual encoder, a visual-to-text connector (adapter), and a LLM. The visual encoder (often a frozen CLIP-ViT, ResNet, or similar) extracts features per frame or segment (Li et al., 2023, Luo et al., 2023, Khalil et al., 20 Apr 2025). The connector, such as a linear projection or Q-Former, maps these features to the LLM’s token embedding space; innovations here include cross-modal transformers, memory modules, or learnable adapters (Zhang et al., 2023, Luo et al., 2023). The LLM (e.g., Vicuna, LLaMA, Qwen) receives the visual tokens, potentially augmented by audio features or pre-computed graph tokens, and produces responses via autoregressive decoding (Zhang et al., 2023, Li et al., 2023, Khalil et al., 20 Apr 2025, He et al., 16 Sep 2025).

Temporal modeling is achieved either through explicit cross-attention over temporal segments (Tan et al., 5 Apr 2024), hierarchical token merging (Weng et al., 4 Apr 2024), or specialized memory-streaming mechanisms for handling arbitrarily long videos (Qian et al., 25 May 2024). Modern approaches also address both short-term (seconds) and long-term (minutes) dependencies by hierarchical token structuring or keyframe-conditioned context propagation (Tan et al., 5 Apr 2024, Weng et al., 4 Apr 2024, Qian et al., 25 May 2024).

Zero-shot and few-shot performance are central design targets, with many VLLMs trained solely on image/video-caption or instruction pairs and evaluated without task-specific fine-tuning (Li et al., 2023, Luo et al., 2023, Khalil et al., 20 Apr 2025, Weng et al., 4 Apr 2024).

2. Temporal and Multimodal Reasoning

VLLM temporal reasoning is evaluated on tasks ranging from static frame analysis to procedural or causal understanding across long video contexts (Cao et al., 24 Mar 2025). State-of-the-art models incorporate explicit temporal segmentation, memory propagation, and adaptive context selection strategies. For instance, approaches such as Koala introduce two new tokenizers that condition both local segment encoders and global video aggregation on learnable spatiotemporal queries, yielding significant gains in long video QA (Tan et al., 5 Apr 2024).

Multimodality is further expanded by integrating audio encoders (e.g. ImageBind audio towers in Video-LLaMA) with parallel “audio Q-Formers” that project spectrogram features into the LLM space, enabling grounded audio-visual understanding (Zhang et al., 2023). Some models extend beyond vision and audio by encoding object category, spatial, and trajectory information as textual prompts to the LLM, leading to highly accurate multimodal video understanding frameworks (Ranasinghe et al., 25 Mar 2024).

Structured multi-video collaborative reasoning, in which a target video's spatio-temporal graph is fused with related videos via hierarchical frame-graph attention and cross-graph attention, has been shown to improve complex reasoning tasks with minimal increase in prompt length (He et al., 16 Sep 2025).

3. Evaluation Methodologies and Benchmarks

Comprehensive evaluation has expanded beyond captioning and VQA to include action recognition, retrieval, segmentation, and factuality grounding, often using both “hard” metrics (e.g. accuracy, recall, F1) and human- or LLM-as-a-judge quality scores. VLM-Eval is a unified benchmark for captioning, QA, retrieval, and action recognition, providing both conventional metrics and GPT-based match scores that closely track human grading (Li et al., 2023).

The Video SimpleQA benchmark specifically targets factual grounding under objective, externally verified, temporally distributed queries, exposing substantial performance deficiencies in even the strongest current models (F₁ ≤ 54% for Gemini-1.5-Pro; open-source best: Qwen2.5-VL-72B, F₁=43.1%) and revealing overconfidence and temporal reasoning failure modes (Cao et al., 24 Mar 2025). Retrieval-Augmented Generation (RAG) provides an absolute F₁ improvement of 7–10 points at substantial inference-time computational overhead (Cao et al., 24 Mar 2025).

Domain-specific evaluations, such as DVBench, probe safety- and reasoning-centric driving scenarios, demonstrating that VLLMs are not yet reliable for mission-critical tasks without targeted adaptation—no model exceeded 40% accuracy on the proposed GroupEval protocol before domain-specific fine-tuning (Zeng et al., 20 Apr 2025).

4. Efficiency and Scalability: Token Pruning and Streaming

Efficient inference is critical for scaling VLLMs to long videos. Several novel token pruning and streaming mechanisms have emerged:

Attention-Debiased Token Pruning (AdaTP): Addresses the global and local biases in attention-based token selection, enabling models like LLaVA-OneVision-7B to reduce FLOPs to 27.3% of the baseline with no loss—and sometimes slight gains—in accuracy (Sun et al., 26 May 2025).
Dynamic Compression (DyCoke): Coordinates temporal token merging across frames with dynamic per-step KV cache pruning. This ensures that only contextually relevant tokens are kept at each decoding step, yielding 1.4× speedup and up to 1.4× memory reduction, without retraining or accuracy loss (Tao et al., 22 Nov 2024).
SHAllow-LayeR Pruning (ShaRP): Corrects for segment-local attention collapse, positional encoding bias, and token redundancy to allow extremely aggressive early-layer token pruning (up to 86% of tokens pruned with <3% accuracy drop), setting new baselines in speed-accuracy trade-off (Xia et al., 5 Dec 2025).
Streaming Architectures: VideoStreaming and VideoLLM-online enable constant-token streaming processing for online and arbitrarily long videos, combining memory-propagated encoding and adaptive memory selection. These architectures support efficient, temporally-aligned, and real-time conversation over extended video input at modest computational cost (Qian et al., 25 May 2024, Chen et al., 17 Jun 2024).

5. Factuality, Answerability, and Model Limitations

Despite progress, factual adherence and uncertainty calibration are significant challenges. Video SimpleQA finds that current VLLMs are systematically overconfident, producing more incorrect than “not attempted” responses, especially for open-source models. Temporal reasoning, particularly on medium- and long-term queries, is a pronounced failure mode (F₁ drops from 63.8% on static to 42.1% on long-term) (Cao et al., 24 Mar 2025). Knowledge gaps—especially inability to map observed visual content to external world knowledge—are the dominant cause of factual error.

Recent work has introduced answerability alignment frameworks, equipping VLLMs with the ability to detect when a question is outside the informational scope of a video and to refuse to answer rather than hallucinate. Alignment is achieved by augmenting instruction data with synthetic unanswerable cases and finetuning via supervised cross-entropy or preference-optimization objectives, resulting in answerability F1 scores increasing from near zero to approximately 0.65 (Yoon et al., 7 Jul 2025).

Recommendations for improving VLLM factuality emphasize tight integration of retrieval during both pretraining and inference, learnable temporal memory banks for long-sequence reasoning, explicit confidence calibration, and comprehensive factuality evaluation that spans static, short-, and long-term temporal scopes (Cao et al., 24 Mar 2025).

6. Applications, Domain Adaptation, and Advanced Tasks

VLLMs have demonstrated generalization to diverse domains. Standard models, after minimal adaptation, can serve advanced driver-assistance (ADAS) or anomaly detection in surveillance (Li et al., 2023, Lv et al., 11 Jan 2024): a few hundred in-domain video-instruction pairs suffice to adapt a general VLLM to new safety-critical tasks. For video anomaly detection, VLLMs with long-term context modules can not only localize anomalies but also generate detailed natural-language explanations, outperforming threshold-based methods by up to 4–5% AUC (Lv et al., 11 Jan 2024).

Universal segmentation tasks are addressed with LLM-based systems such as HyperSeg, introducing hybrid entity recognition and fine-grained visual perceiver modules to handle open-vocabulary, referring, and complex reasoning segmentation across both images and videos (Wei et al., 26 Nov 2024).

Generalization is also demonstrated on robotics and multi-video reasoning, with multi-video collaborative frameworks leveraging spatio-temporal graphs and cross-graph attention to inject external knowledge from related videos, increasing accuracy by 1.6–3.9% on standard zero-shot QA benchmarks (He et al., 16 Sep 2025).

References: