Video Foundation Models
- Video foundation models are large-scale, general-purpose architectures that extract and encode spatiotemporal features from video data, enabling robust transfer across diverse tasks.
- They leverage self-supervised, cross-modal, and masked modeling strategies to learn from massive unlabelled video datasets, achieving high performance in action recognition, retrieval, and captioning.
- By integrating image, video, audio, and text modalities through transformer-based and hybrid architectures, VFMs enhance both generative and discriminative capabilities.
Video Foundation Models (VFMs) are large-scale, general-purpose models designed to acquire spatiotemporal representations from video data that can be robustly transferred across a wide range of video understanding, generative, and multimodal tasks. They extend the paradigm of foundation models pioneered in language and vision domains to encompass the unique temporal and multimodal complexities of the video modality, leveraging massive datasets and advanced self-supervised or cross-modal learning strategies.
1. Definition and Taxonomy of Video Foundation Models
VFMs are constructed to extract and encode generic, highly transferable features from videos—often without task-specific labels—so that they can serve as reusable backbones for diverse downstream tasks, including action recognition, video retrieval, temporal and spatiotemporal localization, video question answering, video captioning, generation, and more (Madan et al., 6 May 2024). The current literature categorizes VFMs into three principal types:
Category | Core Principle | Example Methods |
---|---|---|
Image-based ViFMs | Adapt (inflate) pretrained image models | VideoCoCa, CLIP-ViP |
Video-based ViFMs | Train directly with video architectures | VideoMAE, ST-MAE |
Universal Foundation Models | Jointly handle image, video, and other modalities | InternVideo2, Flamingo |
Image-based models exploit strong pretraining from vast image datasets; video-based models specialize in learning temporal structure de novo; universal models integrate multiple modalities, including audio and text, within a single architecture (Madan et al., 6 May 2024, Wang et al., 22 Mar 2024).
2. Training Strategies and Architectural Paradigms
Pretraining Objectives and Masking:
Pretraining VFMs generally relies on masked modeling (for example, masked video modeling as in VideoMAE or ST-MAE), cross-modal contrastive alignment (e.g., video-text, audio-video), or a hybrid of generative and discriminative objectives (Wang et al., 22 Mar 2024, Liu et al., 2023).
A recent advancement is the selective masking approach in “Unmasked Teacher” (UMT), where most low-semantic tokens in video clips are masked out, and only high-importance tokens, as determined by attention scores from a pretrained Image Foundation Model (e.g., CLIP), are aligned using mean squared error against the teacher’s features (Li et al., 2023). This reduces compute and focuses learning on high-level semantics.
Backbone Architectures:
Transformer-based models dominate, often employing 3D tokenization, spatiotemporal attention, and hierarchical pooling (e.g., multi-layer attention pooling in VideoGLUE (Yuan et al., 2023) and InternVideo2 (Wang et al., 22 Mar 2024)). Alternative architectures include vector-quantized autoencoders for object-centric learning (VQ-VFM-OCL (Zhao et al., 27 Feb 2025)) and hybrid designs that integrate CNNs and Transformers.
Multimodal and Cross-modal Fusion:
Universal Foundation Models, such as InternVideo2, unify video, image, text, and audio through contrastive alignment, masked prediction, and next-token language modeling (Wang et al., 22 Mar 2024). Fusion mechanisms include joint attention, prompt-tuning, and, in some work, architecture-level merging (e.g., SAM-CLIP, which fuses CLIP and SAM into a single transformer for efficient multi-task operation (Wang et al., 2023)).
3. Evaluation Benchmarks and Methodologies
Evaluating VFMs requires capturing both appearance (static spatial patterns) and motion (temporal dynamics). Holistic frameworks, such as VideoGLUE (Yuan et al., 2023) and TWLV-I (Lee et al., 21 Aug 2024), assess core capabilities through action recognition, temporal action localization, and spatiotemporal localization.
Recent benchmarks, such as VideoEval (Li et al., 9 Jul 2024), extend standard evaluation by introducing:
- VidTAB: Measuring task adaptation under few-shot or domain-sensitive conditions.
- VidEB: Evaluating the “raw” representation power for downstream tasks without fine-tuning.
AVA-Bench (Mai et al., 10 Jun 2025) evaluates “atomic visual abilities” (AVAs)—localization, counting, orientation, depth, object and scene categorization, etc.—by disentangling evaluation into 14 primitive skills, enabling ability fingerprinting and principled engineering selection based on task demands.
Zero-shot transfer, robustness against perturbations (Schiappa et al., 2023), and efficiency metrics—such as the VideoGLUE Score (VGS), which incorporates both accuracy and computational cost—are also critical for model comparison (Yuan et al., 2023).
4. Model Performance and Empirical Observations
Empirical studies and large-scale surveys have yielded several insights:
- Image-based foundation models, when adapted with techniques like adapters or prompt-tuning, can outperform video-native models on many standard video tasks, likely benefiting from large-scale image pretraining (Madan et al., 6 May 2024).
- Video-based models maintain an advantage on motion-intensive or temporally complex benchmarks (e.g., SSv2, Charades, ActivityNet), performing better when adaptation budgets are tightly constrained (Yuan et al., 2023).
- Universal or multimodal models (InternVideo2, Flamingo) achieve the strongest results in semantic-heavy tasks (retrieval, QA, captioning) due to well-aligned multimodal representations (Wang et al., 22 Mar 2024, Madan et al., 6 May 2024).
- Advanced training strategies combining masked modeling, contrastive objectives, and next-token language prediction (e.g., InternVideo2) result in significant gains in both task generalization and sample efficiency (Wang et al., 22 Mar 2024).
- Selective masking and teacher-student distillation, as in UMT, offer rapid convergence and a favorable trade-off between compute and task transfer, with ViT-L/16 achieving 90.6% top-1 accuracy on Kinetics-400 after 6 days of training on 32 A100 GPUs (Li et al., 2023).
- Plug-and-play approaches for multi-modal video summarization that avoid dense embedding alignment (by directly passing concatenated unimodal textual outputs into an LLM) can match or surpass fully fine-tuned baselines at a fraction of the computational cost (Samel et al., 9 Oct 2024).
5. Robustness, Generalization, and Limitations
Systematic analyses have underscored several shortcomings:
- VFMs exhibit susceptibility to real-world perturbations such as compression artifacts and blur, with significant performance drops in both image and video segmentation tasks. Multimodal models sometimes offer greater robustness in zero-shot scenarios, but trade-offs with base performance exist (Schiappa et al., 2023).
- Many models still underperform on cross-view fine-grained action recognition or fail to generalize under domain shift, especially in settings involving egocentric perspectives or industrial tasks (Ponbagavathi et al., 22 Jul 2024, Wu et al., 28 Jul 2024). Temporal fusion strategies, such as attention-based across-token pooling, can mitigate but not fully resolve such issues.
- VideoEval (Li et al., 9 Jul 2024) demonstrates that simple scaling up of video data does not guarantee improved generalization; combining multiple pretraining paradigms is more effective for adaptation.
- Even with open-vocabulary and zero-shot capabilities, current VFMs' generalization to previously unseen domains and tasks remains limited, especially in scenarios where only a few annotations are available. Addressing these limitations requires more flexible adaptation techniques and enhanced architectural fusion.
6. Applications and Advanced Capabilities
VFMs enable a wide array of applications, including:
- Action and activity recognition, temporal and spatiotemporal localization, video-text retrieval, captioning, and dialogue (Wang et al., 22 Mar 2024).
- Object-centric analysis and instance segmentation for complex scenes, with VQ-VFM-OCL demonstrating improved object discovery, recognition, and downstream prediction due to shared, quantized representations (Zhao et al., 27 Feb 2025).
- Efficient, open-ended video summarization and multimodal instruction following using few-shot or greedy prompt optimization, exploiting plug-and-play model components (Samel et al., 9 Oct 2024).
- Robust semantic video compression frameworks, in which off-the-shelf VFM features align compressed representations for high task-relevant quality at reduced bitrates (Tian et al., 18 Sep 2024).
- Video event reasoning and causal prediction, facilitated by advanced fusion architectures combining VFM perception with world-knowledge reasoning from LLMs (Dubois et al., 8 Jul 2025).
- Compositional video generation aligned to LLM-derived spatial–temporal layouts, supported by test-time optimization and parametric memory for compositional knowledge transfer (Qu et al., 9 Oct 2025).
7. Future Directions
The continued evolution of VFMs is expected to focus on:
- Developing models that efficiently unify generative and discriminative paradigms, moving toward “generalist” video systems that support both open-ended generation and robust understanding (Liu et al., 2023).
- Designing more effective multimodal fusion and alignment modules, as well as parameter-efficient adaptation methods (e.g., prompt basis, adapters) for domain and task transfer (Wu et al., 28 Jul 2024, Wang et al., 2023).
- Extending robustness and transfer by systematically probing and improving atomic visual abilities with benchmarks such as AVA-Bench, and by exploring advanced compositional and cognitive reasoning capabilities (Mai et al., 10 Jun 2025, Dubois et al., 8 Jul 2025).
- Addressing practical bottlenecks in data curation, distributed training, and scaling, leveraging techniques for efficient data handling and parallelism as exemplified by the NeMo pipeline (Patel et al., 17 Mar 2025).
- Deepening research into viewpoint invariance, causal reasoning, long-form video understanding, and reducing computational overhead for edge/real-time deployment (Madan et al., 6 May 2024, Ponbagavathi et al., 22 Jul 2024).
The current trajectory emphasizes not only raw modeling power but also the design of scalable, adaptable, and diagnostically transparent video systems suitable for real-world deployment and rapid innovation across domains.