InternVideo2: Scalable Video Foundation Model
- InternVideo2 is a large-scale video foundation model characterized by a progressive multi-stage training paradigm that integrates masked video modeling, crossmodal learning, and next-token prediction.
- It employs a ViT backbone with attention pooling and lightweight multimodal encoders for audio and text, scaling up to 6 billion parameters for robust spatiotemporal video understanding.
- Its engineered multimodal data pipeline uses semantic segmentation and clip-text alignment to achieve state-of-the-art performance on diverse video, audio, and retrieval benchmarks.
InternVideo2 is a family of large-scale video foundation models (ViFMs) designed for state-of-the-art performance in multimodal video understanding, encompassing recognition, video-text alignment, dialogue, and audio-visual reasoning tasks. InternVideo2 is distinguished by progressive multi-stage training that unifies masked video modeling, crossmodal contrastive learning, and next-token prediction, paired with large model scaling (up to 6 billion parameters) and rigorously engineered spatiotemporal-consistent, multimodal data pipelines (Wang et al., 2024).
1. Architectural Overview and Scaling
InternVideo2 models are based on Vision Transformer (ViT) backbones that sparsely sample eight frames per video and decompose frames into grids of 3D-patch tokens. These patches are embedded with learned positional encodings and processed via 48 transformer blocks. Key architectural features include:
- Attention Pooling: After the final transformer block, spatiotemporal tokens are aggregated into a 768-dimensional global feature via an attention-pooling head.
- Distillation Projection Heads: During stage 1, extra projection layers at the final transformer blocks facilitate alignment with teacher models; discarded post-training.
- Lightweight Multimodal Encoders: Audio (90M params) and text (340M params) encoders are interfaced with the video backbone for contrastive, matching, and masked language modeling objectives during stage 2.
Scaling from 1B to 6B parameters is accomplished by matching depth/width settings from CoCa-1B and InternViT-6B, with additional distillation modules to stabilize early-stage learning (Wang et al., 2024).
2. Progressive Training Paradigm and Mathematical Objectives
InternVideo2 employs a three-stage curriculum designed to maximize the diversity and alignment of spatiotemporal, semantic, and crossmodal representations:
Stage 1: Masked Video Token Reconstruction
- Loss:
where , , and denote InternVideo2, InternVL-6B, and VideoMAEv2-g encoders, respectively, over each unmasked token .
Stage 2: Video–Audio–Speech–Text Alignment
- Loss:
with InfoNCE contrastive loss, crossmodal matching, and standard masked language modeling.
Stage 3: Next-Token Prediction
- Negative log-likelihood on generative tasks:
Training datasets scale from K-Mash2M (2M unlabeled videos) through 110M multimodal web clips and 2M instruction-tuning examples, each stage engineered for modality and context diversity (Wang et al., 2024).
3. Data Pipeline and Spatiotemporal-Consistent Alignment
Effective video-language foundation modeling in InternVideo2 is enabled by meticulous multimodal data curation, emphasizing:
- Semantic Spatiotemporal Segmentation: Automatic detection of shot boundaries via AutoShot ensures every clip is temporally and semantically coherent (0.5 threshold); clips are filtered for informative temporal duration (2–30s).
- Multimodal Captioning: Visual (InternVid), audio (BEATs+Q-former), and speech (WhisperV2-large) captions are generated and fused via a LLM (Vicuna-1.5). Filtering and translation are applied, resulting in high CLIP-similarity video–audio–speech (VAS) caption pairs.
- Clip-Text Alignment: Best 60M video–VAS pairs are selected for crossmodal alignment, improving transfer to text-based tasks.
This pipeline yields robust spatiotemporal–semantic correspondences, mitigating underspecified alignment that typifies single-modality or non-coherent segmentation approaches.
4. Empirical Performance and Benchmark Coverage
InternVideo2 demonstrates leading performance on over 70 video, audio, and video–language tasks, including:
- Video Recognition: Fine-tuned top-1 action accuracy on Kinetics-400 (16 frames, ) reaches 92.1% (vs. CoCa-g 88.9%); zero-shot K400 = 72.7%, surpassing CLIP and EVA-CLIP-E.
- Video-Text Retrieval: Zero-shot R@1 on MSR-VTT achieves 55.9% (vs. UMT-L 40.7%, VideoPrism-g 39.7%).
- Video Captioning: Zero-shot CIDEr on MSR-VTT is 43.5 (vs. Flamingo-3B 40.1).
- Video-Centric Dialogue: MVBench overall 60.9% (vs. prior best 51.1%); MoVQA = 41.0% (over VideoChat 33.6%).
- Audio Tasks: AudioCaps retrieval R@1 = 52.0%; ESC-50 classification accuracy = 98.6%.
- Video Grounding: QVHighlight [email protected] = 56.45% (vs. CLIP+SlowFast 48.38%) (Wang et al., 2024).
Ablations confirm significant boosts from model scaling, teacher/model synergy, audio modality integration (+3.1 pts R@1 MSR-VTT), and AVS caption fusion over unimodal alternatives.
5. Downstream Exploitation: InternVideo2 as a Universal Backbone
The transferability and representational quality of InternVideo2 are illustrated by its adoption in state-of-the-art downstream frameworks:
FDDet (Temporal Action Detection) (Zhu et al., 1 Apr 2025):
- Uses InternVideo2-6B as a frozen backbone for dense clip embedding extraction.
- Frequency-aware adaptive decoupling in feature space (FGAAD) splits and reweights low/high temporal frequencies, enhancing atomic action cues while filtering noisy semantics.
- The TCAR relation network fuses global (bidirectional state-space) and local (dilated temporal/channel) patterns for refined boundary localization.
- Achieves [email protected]–0.7 = 74.4% on THUMOS14 (vs. prior SOTA 69.3%), and 42.4% on ActivityNet-1.3.
Sparse-Dense Side-Tuner (SDST, for Video Grounding) (Pujol-Perich et al., 10 Jul 2025):
- Plugs a frozen InternVideo2-1B backbone into an anchor-free side-tuner with Reference-based Deformable Self-Attention (RDSA).
- Pools and projects multi-layer intermediate features (layers 37–40) to achieve temporal and context adaptation.
- Matches or exceeds full fine-tuning performance with only 4.1M tunable parameters (vs. 147M for end-to-end), gaining up to 9 points mAP over CLIP-based alternatives on QVHighlights, TACoS, and Charades-STA.
6. Limitations and Prospects for Advancement
InternVideo2 does not introduce fundamentally novel architectural components but validates the transformative effect of massive scaling, progressive multi-objective training, and spatiotemporal-consistent, multimodal data engineering. Persisting limitations include:
- Reliance on sparsely sampled, fixed-resolution frames—failing to capture ultra-fine motion.
- High computational cost due to sequential training and large-scale data curation.
- Inherited biases from web-scale data and teacher networks.
- Lack of explicit mechanisms for adaptive sequence length or arbitrary video resolution (Wang et al., 2024).
Future work directions identified include adaptive frame sampling, more efficient training objective scheduling, de-biasing strategies, and integrating InternVideo2 encoders into unified multimodal LLMs for direct reasoning over variable-length video sequences.
7. Significance in the Foundation Model Landscape
InternVideo2 establishes that orchestrated integration of self-supervised and weakly supervised objectives, at scale and with co-designed data pipelines, delivers a highly transferable and reasoning-capable video foundation model. Its representations both set new SOTA on direct benchmarks and prevail as backbone features in downstream architectures, catalyzing advances in temporal action detection, video grounding, and multimodal retrieval with efficiency and broad applicability (Wang et al., 2024, Zhu et al., 1 Apr 2025, Pujol-Perich et al., 10 Jul 2025). The model’s progressive curriculum and robust feature transfer further set a precedent for the design of future generalist video foundation models.