Video MLLMs: Multimodal Video Understanding

Updated 14 December 2025

Video Multimodal LLMs are advanced models that fuse video, text, and audio to enable comprehensive temporal reasoning and multimodal understanding.
They leverage modular architectures with vision encoders, temporal processors, and modality adapters to achieve notable gains in tasks like video QA.
Efficient token compression and dynamic modality selection empower these models to handle long-form video data for applications in summarization, recommendation, and 3D scene analysis.

Video Multimodal LLMs (Video MLLMs) are foundation models that integrate video (spatiotemporal visual), textual, and often auditory signals to perform joint reasoning, perception, and generation across modalities. These architectures extend Multimodal LLMs (MLLMs) beyond static images or text, enabling video question answering, retrieval, summarization, temporal and spatial grounding, recommendation, open-vocabulary recognition, and 3D scene understanding. Video MLLMs combine frozen or fine-tuned vision encoders, modality adapters or token-compression modules, explicit or implicit temporal processors, and LLMs for sequence modeling, instruction following, and contextual inference. Their development and benchmarking are critically informed by the properties of multi-modal supervision, context length and granularity, modality balance, and evaluation of cross-modal integration.

1. Foundations and Motivation

Video understanding with MLLMs requires fusing spatiotemporal observation (frames, motion) with dense semantic and auditory context, surpassing the limits of static image or speech-only models. Standard MLLMs operating on images or text cannot model temporal relations such as causality, action order, and event transitions intrinsic to videos (Zhang et al., 2023). Early MLLM paradigms—BLIP-2, LLaVA, MiniGPT-4, Qwen-VL—demonstrated success on images but became inefficient or inadequate when naively stacked across video frames due to context-length bottlenecks and lack of explicit temporal modeling (Huang et al., 18 Apr 2024).

The motivation for Video MLLMs is twofold:

Data and Compute Efficiency: Processing all frames or modalities is computationally prohibitive, especially for long-form videos (minutes to hours). This necessitates efficient token compression, modality selection, and context management (Wang et al., 21 Jan 2025, Lan et al., 15 Oct 2024).
True Multimodal Reasoning: The majority of current video-centered benchmarks show strong modality bias, with most questions or tasks solvable from subtitles or short snippets, underscoring the need for models and datasets that require genuine multimodal integration (Park et al., 22 Aug 2024).

2. Core Architectures and Temporal Modeling

2.1 Modular Design and Token Flow

State-of-the-art Video MLLMs follow a modular pipeline:

Vision Encoder: Typically a large, frozen ViT variant (e.g., CLIP, EVA-CLIP, DINOv2).
Frame Sampling & Token Compression: Adaptive hierarchical compression (e.g., HiCo, STE, ToMe) reduces spatiotemporal token volume—crucial for long videos (Wang et al., 21 Jan 2025).
Temporal Module: Either implicit (LLM decoder infers temporal structure from flattened tokens) or explicit (temporal encoders—STE, memory caches, global-local pooling—directly aggregate over time) (Li et al., 28 Jan 2025, Lan et al., 15 Oct 2024).
Audio/Textual Streams: Unified with visual tokens prior to or during LLM context construction, often with cross-modal adapters (e.g., Q-Former, connector blocks) (Zhang et al., 2023).
LLM: LLaMA/Qwen-style decoder; cross-attends to multimodal token streams for generation or discrimination.

2.2 Temporal Reasoning and Compression

Explicit temporal modeling via stackable temporal encoders (STE), memory-enhanced compressors, or parameter-efficient adapters has demonstrated clear superiority over implicit schemes in Video MLLMs. STE architectures allow flexible adjustment of temporal receptive fields and compression ratios, leading to marked gains in accuracy (e.g., +4.7% absolute on six-video QA datasets) and robust performance even under severe token reduction (up to 87.5% compression) (Li et al., 28 Jan 2025). Memory-augmented pooling and multiscale transformers further capture both short- and long-term relations required for action/event recognition and sequence grounding (Lan et al., 15 Oct 2024).

3. Multimodal Integration and Modality Bias

The degree to which Video MLLMs achieve “multimodal” understanding is tightly coupled to the informativeness and balance of input modalities:

Modality Importance Score (MIS): A formal diagnostic tool that quantifies, for each modality (e.g., video, subtitle), the degree to which its presence boosts task accuracy—a positive MIS indicates genuine added value (Park et al., 22 Aug 2024). On core video QA datasets (TVQA, LifeQA, AVQA), >89% of questions were unimodal, with genuinely complementary (cross-modal) questions rare (<2.4%). Permutation ablation shows current Video MLLMs often over-rely on subtitles or static frames, failing to effectively integrate information across modalities.
Dataset Design: Unimodal bias in benchmarks leads to spurious performance gains and limits progress in multimodal reasoning. MIS-derived filtering and question construction, as pioneered in (Park et al., 22 Aug 2024), are crucial for dataset and model development cycles going forward.

4. Training Paradigms and Data Scaling

Video MLLMs benefit from progressive curriculum schedules, fine-grained instruction tuning, and careful data augmentation:

Pretraining and Instruction Tuning: Two-stage recipes are common: (1) large-scale pretraining for basic modality alignment; (2) high-quality instruction tuning on conversation-style, temporally aware datasets (Zhang et al., 2023, Wang et al., 21 Jan 2025).
Text-to-Image/Frame Augmentation: Training with synthetic, temporally coherent text-to-image augmented samples bridges the modality gap and boosts data efficiency, enabling state-of-the-art performance even when using only ~15% of the real video annotation budget. This addresses the “low learning efficiency” phenomenon seen with naïve data scaling and boosts generalization to long-form video QA (Yin et al., 29 Nov 2024).
Reinforcement and Reward Shaping: Rule-based RL and specifically “Temporal Guided Proximal Policy Optimization” (T-GRPO) create inductive bias for temporal reasoning and produce significant gains on spatial, causal, and temporal benchmarks (Feng et al., 27 Mar 2025).

5. Evaluation, Benchmarks, and Failure Modes

Evaluation of Video MLLMs spans broad multi-modal QA, temporal grounding, summarization, open-vocabulary recognition, and more:

Comprehensive Benchmarks: Video-MME provides a full-spectrum, multi-domain testbed, establishing state-of-the-art results for proprietary (Gemini 1.5 Pro, GPT-4o) and open-source (LLaVA-NeXT-Video, Video-LLaVA) models. Results reveal persistent gaps on long-form reasoning, arithmetic, event correlation, and modal fusion tasks (Fu et al., 31 May 2024).
Temporal Grounding and Open-Vocabulary: VTG tasks require segment localization; models leveraging explicit temporal encoding, contrastive learning, and unified embeddings achieve near-finetuned performance even in zero-shot (Wu et al., 7 Aug 2025). Open-vocabulary video emotion recognition leverages tri-modal (vision, audio, text) fusion and achieves superior recall over fixed-class baselines (Ge et al., 21 Aug 2024).
Attention Failures and Hallucinations: Frame selection via vision-language encoders (e.g., SigLIP, CLIP) is often brittle; encoder confidence correlates weakly to model accuracy, and naive or oracle sampling reveals large, unexploited accuracy gaps (Ok et al., 1 Sep 2025). Temporal hallucination assays (e.g., VidHalluc) show models confounding action, sequence, and scene changes; spatial saliency reweighting (DINO-HEAL) offers a lightweight, training-free mitigation (Li et al., 4 Dec 2024).

6. Efficient Adaptation and Industrial Deployment

Resource efficiency and modularity are critical for scaling Video MLLMs:

Plug-and-Play Adaptation: Minimal-parameter plug-in modules, such as the temporal adapter in RED-VILLM, allow direct reuse of frozen image-LLM backbones and require only ~100K instruction-tuning samples, yielding or surpassing performance of heavyweight pipelines (Huang et al., 18 Apr 2024).
Token Compression/Pruning: Adaptive hierarchical compression (HiCo) and memory-enhanced modules permit processing of 6,000+ frames per context, supporting long-form video understanding with manageable compute footprints (Wang et al., 21 Jan 2025, Lan et al., 15 Oct 2024).
Modality Selection and Efficiency: Empirical trade-offs show that subtitles deliver strong accuracy gains at modest GPU cost, whereas full visual pipelines are optimal but expensive. Audio adds only marginal gains unless subtitles are absent or corrupted, guiding resource-aware deployment (Li et al., 14 Oct 2025).

7. Applications, Future Challenges, and Directions

Video MLLMs are deployed in a range of applications:

Video Summarization: Frame-level captioning and importance scoring by frozen LLMs, followed by attention-based global aggregation, outperform visual baselines and generate semantically richer, user-aligned video summaries (Lee et al., 15 Apr 2025).
Personalized Recommendation: Injection of high-level MLLM-generated summaries into recommender pipelines yields substantial improvements in hit rates and nDCG over metadata or low-level features (Nadai et al., 13 Aug 2025).
3D Scene Reasoning: Video-based MLLMs with cross-task adapters and metric depth heads enable 3D question answering, dense captioning, and spatial grounding—without dependence on explicit 3D input data (Chen et al., 29 Sep 2025).
Video Coding and Compression: MLLM-based cross-modality coding (CMVC) disentangles content and motion for ultra-low bitrate semantic or perceptually realistic video reconstruction, surpassing traditional codecs in several regimes (Zhang et al., 15 Aug 2024).

Persistent Challenges:

Explicit long-range temporal dependencies and hierarchical event modeling (causal, compositional).
Scalable evaluation benchmarks probing cross-modal and chain-of-thought reasoning.
Rich multi-modal pretraining data and open, high-quality video QA/grounding corpora.
Efficient modality scheduling, dynamic token routing, and adaptive inference for real-time or edge deployment.

Continued advances in temporal modeling, modality-balanced datasets, and resource-efficient adaptation are expected to drive the next generation of Video MLLMs across increasingly multimodal, dynamic, and contextually rich real-world tasks (Park et al., 22 Aug 2024, Wang et al., 21 Jan 2025, Li et al., 28 Jan 2025, Lan et al., 15 Oct 2024, Li et al., 14 Oct 2025, Li et al., 4 Dec 2024, Fu et al., 31 May 2024, Yin et al., 29 Nov 2024, Huang et al., 18 Apr 2024, Lee et al., 15 Apr 2025, Nadai et al., 13 Aug 2025, Chen et al., 29 Sep 2025, Ok et al., 1 Sep 2025, Zhang et al., 2023, Feng et al., 27 Mar 2025, Wu et al., 7 Aug 2025, Ge et al., 21 Aug 2024).