Vidi2 Model: Unified Multimodal Video Understanding

Updated 27 November 2025

Vidi2 is a large multimodal model that advances video understanding through fine-grained temporal retrieval, spatio-temporal grounding, and video QA.
It employs a unified Gemma-3 backbone with specialized visual, audio, and text encoders to process long videos and complex multimodal queries.
The model demonstrates significant performance improvements on benchmarks like VUE-STG, achieving over 15-point gains in vIoU compared to prior systems.

Vidi2 is a large multimodal model designed for comprehensive video understanding and generation, with capabilities spanning fine-grained temporal retrieval, spatio-temporal grounding (STG), and video question answering (QA). Developed as the successor to the original Vidi model, Vidi2 expands both model capacity (scaling to 12B parameters) and task coverage with novel architecture choices, upgraded training pipelines, and new benchmarks for holistic video reasoning tasks (Team et al., 24 Nov 2025).

1. Model Architecture

Vidi2 is built around a unified architecture employing a Gemma-3 backbone, extended to handle long video and multimodal inputs. The core architecture consists of three primary encoding streams:

Visual Encoder: A Vision Transformer (ViT-type) network tokenizes each frame into patch embeddings followed by an adaptive token-compression module that selects salient tokens along the temporal axis, maintaining a tractable sequence length for long-context videos.
Audio Encoder: A 1D-CNN combined with a Transformer extracts per-second audio tokens from the input stream.
Text Encoder: Free-form text queries are embedded using Gemma-3’s LLM text embedding layer, producing word-piece tokens.

These encoded streams are concatenated and processed as a joint token sequence in the Gemma-3 Transformer, where mutual self-attention and cross-attention enable multimodal interaction without requiring separate fusion heads.

The backbone outputs serve three prediction modules:

Temporal Retrieval Head: A linear layer applied to per-token features, outputting logits for per-second start/end predictions.
Spatio-Temporal Grounding (STG) Head: An auto-regressive LLM head generating sequences of (timestamp, x₀, y₀, x₁, y₁) tokens, effectively describing bounding-box tubes for each timepoint.
Video QA Head: An LLM-generation head producing open-form natural language answers (Team et al., 24 Nov 2025).

2. Spatio-Temporal Grounding (STG)

Vidi2’s STG functionality allows it to, given a video $V$ and a textual query $Q$ , predict:

A time interval $T_\mathrm{pred} \subset \mathbb{R}$ ,
A spatio-temporal tube $B_\mathrm{pred}(t) = (x_0(t), y_0(t), x_1(t), y_1(t))$ for each $t \in T_\mathrm{pred}$ .

Time is discretized at 1 fps, and box predictions are made for each temporal step within $T_\mathrm{pred}$ . Supervised fine-tuning employs a sequence modeling loss:

$L_\mathrm{SFT}(V, Q) = -\sum_{n=1}^N \log P(y_n \mid y_{<n}, V, Q)$

where $Y = [y_1,\dots,y_N]$ is the target token sequence. To further encourage accuracy in spatial grounding, two auxiliary losses may be added:

Box Regression: $L_\mathrm{reg} = \sum_{t\in T_\mathrm{gt}} \|B_\mathrm{pred}(t) - B_\mathrm{gt}(t)\|_1$
IoU Penalty: $L_\mathrm{IoU} = 1 - \mathrm{vIoU}(B_\mathrm{pred}, B_\mathrm{gt})$

Total loss for STG combines all terms:

$L_\mathrm{STG} = L_\mathrm{SFT} + \lambda_1 L_\mathrm{reg} + \lambda_2 L_\mathrm{IoU}$

This comprehensive objective enables Vidi2 to learn to temporally and spatially localize queried content, supporting applications such as character tracking, event segmentation, and intelligent reframing (Team et al., 24 Nov 2025).

3. Benchmarks and Evaluation Metrics

Vidi2 was evaluated on two new or newly refined video understanding benchmarks specifically targeting its advanced retrieval and grounding abilities:

VUE-STG Benchmark

Scope: 982 video clips (204.8 hours), spanning durations from 10 seconds to 30 minutes.
Queries: 1,600 human-verified noun-phrase queries.
Annotations: Per-second bounding-box tubes curated by human annotators.
Metrics:
- vIoU (spatio-temporal intersection-over-union over union)
- tIoU (temporal intersection-over-union)
- vIoU-Int (spatio-temporal IoU over intersection)

VUE-TR-V2 Benchmark

Scope: 1,600 clips (310.7 hours), with a balanced length distribution including ultra-long videos.
Queries: 1,600 free-form, user-style sentences (covering vision, audio, or multimodal content).
Metric: AUC of precision/recall/IoU curves, with IoU as the main ranking criterion.

Refined Metric Definitions (with sets $T_\mathrm{pred}$ and $T_\mathrm{gt}$ —predicted and ground-truth time intervals, $T_\cap = T_\mathrm{pred} \cap T_\mathrm{gt}$ , $T_\cup = T_\mathrm{pred} \cup T_\mathrm{gt}$ ):

$\mathrm{tIoU} = |T_\cap| / |T_\cup|$
$\mathrm{vIoU} = (1/|T_\cup|) \sum_{t\in T_\cup} \mathrm{IoU}_t$
$\mathrm{vIoU}$ -Int $= (1/|T_\cap|) \sum_{t\in T_\cap} \mathrm{IoU}_t$
$\mathrm{IoU}_t$ is set to $bIoU(B_\mathrm{pred}(t), B_\mathrm{gt}(t))$ if $t\in T_\cap$ , $0$ otherwise.

These metrics allow for fine-grained evaluation of both temporal localization and spatial/temporal object tracking (Team et al., 24 Nov 2025).

4. Training Paradigm and Data

Vidi2 leverages large-scale pretraining and fine-tuning pipelines:

Pretraining Data: Video-text pairs scraped from the web, mixing authentic (noisy, diverse) and synthetic data.
STG Supervision: Millions of synthetic video tubes from image-level grounding datasets (with simple temporal augmentation); $\sim 50$ K real human-annotated video–tube pairs.
Temporal Retrieval: Expanded temporal retrieval data to longer clips and audio-focused queries.
Video QA: Multi-choice/open-ended datasets (LVBench, Long VideoBench, VideoMME).
Optimization: Cross-entropy on tokens, $L_1$ regression for coordinates, AdamW with multi-stage learning rates.

This data strategy supports both STG generalization and natural language understanding for retrieval and QA without sacrificing model scaling (Team et al., 24 Nov 2025).

5. Comparative Performance Analysis

Vidi2 demonstrates strong, quantifiable improvements over prior proprietary and open models:

Benchmark	Vidi2	Gemini 3 Pro	GPT-5
VUE-STG (vIoU/ vIoU-Int/tIoU)	32.57 / 60.30 / 61.43	16.59 / 33.64 / 38.68	13.01 / 18.47 / 64.66
VUE-TR-V2 (AUC/IoU/P/R)	48.75 / 62.45 / 64.93	37.58 / 48.61 / 56.30	17.15 / 29.64 / 26.63
LVBench (Acc.)	45.8	78.7	—
LongVideoBench (Acc.)	57.1	84.3	—
VideoMME (Acc.)	63.5	—	—

Vidi2 achieves a >15-point absolute improvement in vIoU on STG and an 11-point IoU gain on temporal retrieval relative to proprietary systems. On open-ended video QA, Vidi2’s accuracy is on par with Qwen2.5-VL-7B and outperforms most open-source models with similar scale, though still behind the best proprietary LLMs on language-heavy benchmarks—a plausible implication is that Vidi2’s primary strengths lie in video-centric retrieval and grounding (Team et al., 24 Nov 2025).

6. Applications and Implications

Vidi2’s unified multimodal pipeline and strong empirical performance enable a range of applications:

Editing and Understanding: Plot or character comprehension, automatic multi-view switching, intelligent cropping, and shot composition based on content grounding.
Retrieval and Search: High-precision, long-context temporal localization and multimodal Q&A over raw video.
Benchmarking and Evaluation: Vidi2’s introduction of VUE-STG with refined metrics addresses prior problems with short-context and poorly annotated datasets, supporting more realistic evaluation of spatio-temporal localization systems.

Vidi2 establishes a strong baseline for subsequent development of large, general video models with joint video/text/audio understanding, supporting both generative and discriminative tasks on long-form, real-world video data (Team et al., 24 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Vidi2: Large Multimodal Models for Video Understanding and Creation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vidi2 Model.