Video Multi-Modal LLMs
- Video Multi-Modal LLMs are transformer-based models that integrate video, audio, and language, enabling unified spatiotemporal analysis and cross-modal understanding.
- They employ retrofit and end-to-end architectures with techniques such as temporal attention and token compression to efficiently process long, complex video sequences.
- Key applications include video captioning, action recognition, anomaly detection, and summarization, while addressing challenges in scalability and multimodal fusion.
Video Multi-Modal LLMs (MLLMs) are a class of neural architectures that unify natural language processing with video (and often audio and image) understanding through large-scale transformer-based models. Unlike uni-modal LLMs or even image-text MLLMs, video MLLMs must reason over long, high-complexity spatiotemporal sequences and integrate multimodal cues—including dynamic vision, language, and often audio—under strict computational constraints. This encyclopedia entry synthesizes the architectural principles, paradigms, technical advances, evaluation standards, and ongoing challenges for video MLLMs, referencing developments up to late 2025.
1. Foundational Architectures and Modalities
Video MLLMs generalize image–text MLLMs by introducing the temporal video dimension and (often) audio as additional modalities. The dominant architectural strategies fall into two broad classes: retrofit and end-to-end models (Carolan et al., 28 Mar 2024).
- Retrofit Approaches: Attach a video encoder (e.g., ViT or 3D CNN) to a frozen LLM. Each frame or short clip is embedded by the vision backbone, optionally temporally aggregated (e.g., with transformers or pooling), then projected to match the LLM embedding space. Cross-modal connectors—typically lightweight cross-attention modules such as Q-Former, as in BLIP-2 or MiniGPT4—enable the LLM to attend to visual tokens. Vision tokens may be injected as “soft prompts” to the LLM’s input (Carolan et al., 28 Mar 2024, Zhang et al., 2023, Lyu et al., 2023).
- End-to-End Models: Jointly train all parameters from patch-to-text, flattening spatial and temporal tokens across video frames and interleaving them with linguistic tokens. Temporal self-attention and joint cross-modal heads are distributed through the transformer stack, with 2D+1D positional embeddings (spatial patches + time) (Carolan et al., 28 Mar 2024, Shi et al., 14 Apr 2025).
A general structure across both families:
- Visual Encoder: Frozen or fine-tuned ViT, CLIP, SigLIP, or UMT backbones process frames or video chunks.
- Temporal Modeling: Self-attention or modular adapters (e.g., temporal Q-Former, 3D Conv + ViT, pooled token merging, or slow-fast token pathways) capture dependencies across time (Shi et al., 2 Apr 2025, Shi et al., 14 Apr 2025).
- Fusion & Output: Cross-modal attention aligns visual tokens with the language backbone; outputs include answers, captions, summaries, or temporal spans.
Many recent models integrate audio by analogous frozen encoders (e.g., ImageBind or Whisper for audio/mel-spectrograms) and separate Q-Former adapters, with embeddings projected into LLM space for unified token-level fusion (Zhang et al., 2023, Lyu et al., 2023).
2. Temporal Modeling, Token Compression, and Scalability
Video MLLMs must address prohibitive quadratic or even cubic scaling in token count (frames × patches × modalities), especially for long video sequences. This challenge has led to a diverse set of strategies:
- Temporal Self-Attention: Full attention across all frame-patch tokens is O((Tn)²), quickly becoming infeasible for long videos. Hierarchical pooling, sparse attention, or local windowing are necessary (Carolan et al., 28 Mar 2024).
- Chunked Representations: Mavors encodes video via intra-chunk 3D convolution + ViT, then applies an inter-chunk transformer with rotary positional encoding to maintain temporal fidelity across adjacent segments, yielding state-of-the-art video captioning performance with tractable cost (Shi et al., 14 Apr 2025).
- Token Compression: Hybrid-level, instruction-injected compression (HICom) injects downstream query information to highlight instruction-relevant tokens; local (block-wise) and global (learnable token) branches conduct attention-based reduction, achieving up to 79% fewer tokens for similar or better task performance (Liu et al., 20 Mar 2025).
- Slow-Fast Token Pathways: A slow-fast design feeds fixed, highly compressed “fast” tokens to self-attention for coarse overview, while permitting linearly scalable cross-attention from text tokens to uncompressed “slow” visual tokens, thus maintaining spatial detail as temporal context grows (Shi et al., 2 Apr 2025).
- Instruction-Conditioned Fusion: Methods such as RED-VILLM leverage pre-aligned image LLMs by inserting lightweight, plug-and-play temporal modules into image–token fusion, enabling efficient extension to video with minimal new parameters and rapid convergence (Huang et al., 18 Apr 2024).
- Foveated Input: GazeLLM leverages eye-tracking to crop high-resolution video inputs to gaze-centered regions, reducing pixel and token counts by 10–100× while preserving or improving comprehension versus full-frame input—particularly relevant for first-person video domains (Rekimoto, 31 Mar 2025).
Models also explore plug-and-play frame selection and modal “compression-before-fusion” (e.g., textual, optical flow, or action graph representations) to maintain information saliency at ultra-low bitrate or token count (Zhang et al., 15 Aug 2024).
3. Multimodal Training Paradigms, Grounding, and Instruction Tuning
Video MLLMs employ a multi-stage optimization protocol combining large-scale vision–language (and audio) alignment with instruction tuning:
- Pre-training: Contrastive video–text objectives (InfoNCE) and cross-entropy captioning losses on massive pairs (e.g., WebVid-2M, ActivityNet, COIN, HowTo100M) yield cross-modal embedding spaces (Carolan et al., 28 Mar 2024, Zhang et al., 2023, Huang et al., 18 Apr 2024).
- Masked/Labeled Frame Modeling: Masked patch prediction/labeling and masked frame modeling losses improve fine-grained spatial–temporal representations (Carolan et al., 28 Mar 2024). Temporal Q-Former and temporal adapters enable explicit grounding of motion.
- Instruction Tuning: Downstream task-specific data (QA, summarization, retrieval) drives instruction tuning under autoregressive cross-entropy objectives, often on multi-million-sample datasets of mixed video/QA, captioning, and reasoning pairs (Huang et al., 18 Apr 2024, Shi et al., 14 Apr 2025, Shi et al., 2 Apr 2025, Zhang et al., 2023).
- Compositional or Modular Training: RED-VILLM freezes base image and alignment modules, tuning only the temporal adapters and instruction head for sample efficiency (Huang et al., 18 Apr 2024). Macaw-LLM demonstrates unified, end-to-end alignment without explicit contrastive pre-training (Lyu et al., 2023).
- Chain-of-Thought and Reflective Chaining: For non-standard tasks (e.g., anomaly detection, multi-step event reasoning), structured prompting (chain-of-thought, taxonomy-guided rules, LLM self-reflection) substantially boosts performance over zero-shot or vanilla prompting (Zhao et al., 15 Jun 2025).
Temporal grounding (moment retrieval) tasks are often addressed by interleaving per-frame timestamp tokens with visual tokens; state-of-the-art models such as UniTime and Chrono adaptively manage token scaling and leverage explicit timestamped queries for precise segment localization (Meinardus et al., 26 Jun 2024, Li et al., 23 Jun 2025).
4. Evaluation Benchmarks and Performance Metrics
Evaluation of video MLLMs spans a range of domains and task formulations. Representative benchmarks include:
| Benchmark | Domain(s) | Principal Metric(s) | Reference |
|---|---|---|---|
| Video-MME | General, 6 domains, 11s–1h | Multiple-choice accuracy | (Fu et al., 31 May 2024) |
| MVBench | 20 temporal video tasks | Multiple-choice accuracy | (Li et al., 2023) |
| LongVideoBench | 15s–1h, multi-modal QA | MC accuracy, recall@1 | (Li et al., 14 Oct 2025) |
| DREAM-1K | Video captioning | Generative metrics (human) | (Shi et al., 14 Apr 2025) |
| EgoSchema, QVHighlights | Egocentric, highlight | Top-1 accuracy, recall@1 | (Ranasinghe et al., 25 Mar 2024) |
| EPIC-KITCHENS-100-MQA | Egocentric action | Action classification | (Ye et al., 24 Mar 2025) |
| SmartHome-Bench | Video anomaly detection | Accuracy, F1, AUC | (Zhao et al., 15 Jun 2025) |
Multimodal integration and efficiency are assessed by comparing accuracy against memory/latency trade-offs (e.g., frames vs. subtitles vs. audio), with subtitles often yielding high accuracy at far lower resource consumption than visual tokens (Li et al., 14 Oct 2025).
Top-performing models on these benchmarks include Mavors for spatiotemporal fidelity (Shi et al., 14 Apr 2025), Slow-Fast MLLM for scalable context (Shi et al., 2 Apr 2025), and RED-VILLM for efficiency (Huang et al., 18 Apr 2024). On Video-MME, Gemini 1.5 Pro (closed-source) reaches 75.7% accuracy on frames alone, outperforming open-source models that peak near 66.2% (Fu et al., 31 May 2024).
5. Specialized Paradigms and Applications
Video Compression by MLLMs
Cross-Modality Video Coding (CMVC) reframes video compression in terms of semantic “what” (spatial content) and temporal “how” (motion), each represented compactly in the semantic spaces of MLLMs (Zhang et al., 15 Aug 2024). Two principal encoding–decoding modes:
- Text-Text-to-Video (TT2V): Both content and motion are represented as text or embeddings. Ultra-low bitrate, suitable for semantic preservation but sacrifices fine texture.
- Image-Text-to-Video (IT2V): Keyframes as compressed images plus motion as text, enabling higher perceptual fidelity at slightly higher bitrate. A LoRA-tuned diffusion interpolation engine enables smooth motion synthesis in IT2V.
Experimental results show 10–100× lower bitrate than pixel-level codecs (VVC/x264/x265) for comparable perceptual/semantic quality, with competitive performance across DISTS, SSIM, LPIPS, and FID (Zhang et al., 15 Aug 2024).
Action Recognition
Multi-modal video MLLMs are currently challenged by hard-negative queries with visually and semantically similar distractors (Ye et al., 24 Mar 2025). Recent advances incorporate adversarial distractor mining, auxiliary supervision on action tokens (verb/noun), explicit temporal detection heads, and memory/context from preceding actions. Fine-tuned LLaVAction-7B achieves >71% on EPIC-KITCHENS-100-MQA (hard negatives), surpassing GPT-4o by 21 points (Ye et al., 24 Mar 2025).
Summarization and Narration
VideoNarrator presents a modular, training-free pipeline for dense video captioning, using off-the-shelf MLLMs as generator, context provider, and verifier. Tightly timestamped chunk-wise summaries support narrative fidelity and serve as effective downstream cues for QA and content indexing (Wu et al., 22 Jul 2025). LLM-based Video Summarization (LLMVS) transforms frame-wise visual features into language captions, scores importance with a frozen LLM using local context, and globally aggregates these via self-attention, producing state-of-the-art semantic summaries on SumMe and TVSum (Lee et al., 15 Apr 2025).
Anomaly Detection and Domain Adaptation
SmartHome-Bench establishes a taxonomy-driven benchmark for anomaly detection in household video. A Taxonomy-Driven Reflective LLM Chain (TRLC) introduces self-correcting prompt chains based on explicit anomaly rules, achieving up to a 35% relative gain in accuracy over zero-shot baseline and handling ambiguous “vague abnormal” scenarios more robustly (Zhao et al., 15 Jun 2025).
6. Current Challenges and Research Directions
Open research challenges for video MLLMs include:
- Video–Text Alignment at Scale: Maintaining fine-grained spatial–temporal grounding, especially with compressed or instruction-conditioned tokens, remains a significant hurdle (Zou et al., 27 Sep 2024, Liu et al., 20 Mar 2025, Shi et al., 2 Apr 2025).
- Long Video and Event Reasoning: Explicitly modeling hierarchical events, long-term dependencies, and between-event reasoning requires advances in memory architectures (hierarchical, memory-bank, episodic retrieval) and temporal chunking (Zou et al., 27 Sep 2024, Shi et al., 14 Apr 2025).
- Multimodal Integration: Efficient fusion of video, audio, subtitles, and structured cues (transcripts, action graphs) is crucial but under-addressed. Subtitles provide exceptional efficiency–accuracy trade-offs, yet robust audio integration remains limited in most open-source models (Li et al., 14 Oct 2025).
- Handling Hallucination and Factual Consistency: Video MLLMs are susceptible to hallucinations, context omission, and grounding errors, especially in rare or ambiguous scenarios (Zhao et al., 15 Jun 2025, Ranasinghe et al., 25 Mar 2024).
- Evaluation Practices: Gold-standard, temporally dense metrics and large-scale benchmarks for both generative and discriminative tasks are needed to track progress across realistic, unconstrained video settings (Shi et al., 14 Apr 2025, Fu et al., 31 May 2024).
- Efficiency and Scalability: Token economy remains an essential aspect—progress has been made with instruction-conditioned compression (HICom), chunked/slow-fast pathways, and foveal/gaze-based cropping, but scaling to hour-long, multi-modal streams is a continuing limitation (Liu et al., 20 Mar 2025, Rekimoto, 31 Mar 2025).
- Domain and Data Adaptation: Modular pipelines and plug-and-play adapters facilitate efficient domain specialization, but effective transfer to low-resource or highly specialized video (e.g., industrial, medical, sensor-fused) is still nascent (Huang et al., 18 Apr 2024).
7. Future Directions
Active research in video MLLMs is advancing several directions:
- Hierarchical and Adaptive Memory: Explicit modeling of frames → events → scenes → long-term context, supporting real-time streaming and live event tracking (Zou et al., 27 Sep 2024, Shi et al., 14 Apr 2025).
- Token and Compression Innovation: Adaptive, loss-aware token merging; dynamic instruction-conditioned gating; hybrid slow-fast or multi-pathway architectures (Shi et al., 2 Apr 2025, Liu et al., 20 Mar 2025).
- Multimodal, Cross-Domain Extension: Incorporating richer sensor inputs (e.g., depth, IMU, event streams), unified pipelines for text, video, audio, and vision (Lyu et al., 2023).
- Instruction and Prompt Optimization: Task-specific, chain-of-thought, and self-reflective prompting for robustness, explainability, and error correction (Zhao et al., 15 Jun 2025).
- Ethical and Safety Frameworks: Addressing privacy, bias, and misuse—including deepfake analysis and personal data—by integrating explicit policy modules and audit trails (Carolan et al., 28 Mar 2024).
- Benchmark and Dataset Expansion: Creation of hour-long, richly annotated datasets and rigorous, multi-facet evaluation protocols to push limits on context, abstraction, and generalization (Fu et al., 31 May 2024, Zou et al., 27 Sep 2024).
Video MLLMs now underpin domains as diverse as generative compression, fine-grained action understanding, video QA, long-form summarization, anomaly detection, and cross-modal event reasoning. Their evolution is being propelled by architectural ingenuity, dataset growth, and explicit focus on the computational and multimodal realities of modern video data.