Omni Video-Text Models

Updated 9 December 2025

Omni video-text models are unified systems that integrate video, text, audio, and visual cues into a single framework to support tasks like retrieval, captioning, and real-time processing.
They employ techniques such as streaming transformers and retrieval-augmented generation to synchronize multi-modal inputs and enable fine- and coarse-grained adaptivity in complex video understanding.
These models demonstrate significant performance gains on benchmark tasks by leveraging hierarchical evidence retrieval and cross-modal fusion for actionable, context-aware outputs.

Omni video-text models are unified computational systems designed to process, ground, generate, or retrieve information across arbitrary combinations of video, text, and often other modalities such as audio or images. These models aim to deliver robust, context-aware capabilities for long-form, streaming, or multi-reference video understanding, cross-modal retrieval, controllable video generation, and fine-grained spatio-temporal reasoning by utilizing architectures and training protocols that fuse multiple sources of multimodal input into coherent outputs suitable for open-domain queries and downstream application requirements (Xue et al., 16 Jun 2025, Wang et al., 26 Mar 2024, Han et al., 4 Dec 2025, Xu et al., 26 Mar 2025).

1. Fundamental Principles and Definitions

Omni video-text models unify the processing of video sequences and textual data, often in conjunction with audio and other modalities, within a single end-to-end framework. Core tenets include:

Unified Input/Output Space: Tasks ranging from classification and retrieval to question answering and open-ended captioning are recast as conditional token generation or embedding alignment over a multimodal, often jointly-trained, architecture (Wang et al., 26 Mar 2024, Wang et al., 2022).
Streaming and Real-time Processing: Many omni models process video and text as a continuous stream (not batch/offline), incrementally building internal temporal state and generating responses without access to future frames, supporting applications in proactive surveillance, real-time dialogue, and continuous monitoring (Xu et al., 26 Mar 2025, Wang et al., 29 Mar 2025).
Cross-modal Fusion: Inputs from heterogeneous modalities (frames, audio, OCR, ASR, captions, knowledge graphs) are aligned into a shared representational space via attention, specialized pooling, contrastive learning, or other multimodal fusion mechanisms (Liu et al., 6 Feb 2025, Xu et al., 3 Oct 2025, Chen et al., 2023).
Fine- and Coarse-grained Adaptivity: Retrieval and grounding mechanisms adapt to query complexity, supporting hierarchical evidence extraction and context-aware generation (Xue et al., 16 Jun 2025, Heo et al., 14 Jan 2025, Lin et al., 4 Nov 2025).

2. Representative Architectures and Retrieval-Augmented Methods

Recent omni video-text models instantiate a diverse range of architectural patterns:

Retrieval-Augmented Generation (RAG) with Adaptive Granularity: AdaVideoRAG (Xue et al., 16 Jun 2025) introduces an intent classifier to dynamically route queries to different retrieval depths (no retrieval, naive cross-modal retrieval, or graph-enhanced retrieval) over an omni-knowledge index, fusing text, visual, and semantic graph evidence to optimize resource-efficiency and answer accuracy.
Encoder-Decoder with Universal Tokenization: OmniVid (Wang et al., 26 Mar 2024) and OmniVL (Wang et al., 2022) process all tasks as multimodal token sequence generation, supporting action recognition, captioning, QA, and tracking via unified time and box token vocabularies.
Multi-Branch Specialized Fusions: HumanOmni (Zhao et al., 25 Jan 2025) employs three parallel visual branches (face, body, interaction) with instruction-driven adaptive fusion for fine discrimination in human-centric video understanding, coupled with audio integration via Whisper-derived representations.
Unified Bi-Encoder Retrieval: Omni-Embed-Nemotron (Xu et al., 3 Oct 2025) bridges text, video, audio, and image retrieval with a single neural bi-encoder trained via contrastive InfoNCE loss, projecting all modalities into a shared embedding space for dot-product similarity search.
Streaming Transformer Mechanisms: Qwen2.5-Omni (Xu et al., 26 Mar 2025) introduces a time-aligned multimodal rotary position embedding (TMRoPE) for synchronizing block-wise processed video and audio tokens, and utilizes a Thinker–Talker split to jointly produce text and speech outputs in real-time.

Omni video-text frameworks for long or complex videos integrate advanced knowledge-indexing and evidence retrieval strategies:

Omni-Knowledge Indexing: AdaVideoRAG (Xue et al., 16 Jun 2025) splits long videos into uniform clips, producing parallel databases:
- D_C: Caption database from frame captions via MiniCPM-V.
- D_A: ASR database from FastWhisper.
- D_O: OCR-extracted scene text via EASYOCR.
- D_V: Visual embeddings via ImageBind.
- G: Semantic knowledge graph via BGE-M3, containing entities and relations extracted from pooled text tokens.
Hierarchical Evidence Retrieval: Query intent (L1–L3) governs which index levels are accessed. For simple queries, direct model inference suffices; for complex multi-entity reasoning, graph and joint modality retrieval is activated. After candidate extraction via cosine similarity, evidence is temporally ordered and injected into a large multimodal LLM for answer synthesis.

This structure enhances both efficiency and answer quality, demonstrated by substantial Win-Rate and accuracy gains on HiVU and Video-MME benchmarks: up to +39.8% absolute on multiple-choice accuracy and +38.84% on open-ended Win-Rate, with module-specific ablations isolating the contributions of graph retrieval and auxiliary text grounding (Xue et al., 16 Jun 2025).

4. Unified, Generative, and Controllable Video-Text Processing

Omni models target general-purpose video understanding, grounding, and generation by formulating diverse tasks as unified sequence generation or token classification problems:

OmniTube and Multi-Object Grounding: OmniSTVG (Yao et al., 13 Mar 2025) enables simultaneous localization of multiple query-referenced objects using an end-to-end Transformer (OmniTube), with spatial and temporal query generation tailored to the text structure. Evaluation on BOSTVG (10,018 videos, 287 object classes) validates significant m_tIoU gains for dense multi-object video grounding.
Text/Video Interleaving for Fine Control: TV2TV (Han et al., 4 Dec 2025) interleaves text and video chunk sequences, leveraging a mixture-of-transformers backbone to alternately "plan in words" (via Llama-initialized text tower) and generate video (via flow-matching video tower). This architecture supports fine-grained user interventions (e.g., mid-sequence text edits trigger behavior change in rendered video) and achieves measurable improvements in controllability and prompt alignment on both synthetic (CS:GO) and real-world (sports) datasets.
Instruction-Driven Editing and Generation: OmniV2V (Liang et al., 2 Jun 2025) unifies diverse video generation/editing paradigms (mask-guided editing, pose-driven animation, object transformation) through dynamic content manipulation injection modules atop a diffusion transformer, with LLaVA-based visual-text instruction embedding and cross-modal 3D rotary position encoding for precise reference image–video alignment.

5. Streaming, Proactive, and Interactive Reasoning

Modern omni video-text models extend their operation to interactive, streaming, and proactive settings:

Block-wise Streaming and Synchronization: Qwen2.5-Omni (Xu et al., 26 Mar 2025) and M4 (Wang et al., 29 Mar 2025) process video and audio as block-wise, time-interleaved embeddings with causal masking. TMRoPE ensures token alignment across modalities at matched time indices, critical for real-time instruction following and low-latency, synchronized speech and text outputs.
Proactive Alerting and Turn-Taking: The M4 multiplexing framework (Wang et al., 29 Mar 2025) (benchmarked on OmniMMI) includes highlight-spot attention triggers for alerting, interruption detection via reciprocal perplexity thresholding, and parallel decoding mechanisms for multi-turn dialogue, setting new accuracy and precision benchmarks for streaming scenario tasks such as proactive alerting (Precision 31.6%, IoU 13.9%) and robust turn-taking.

6. Benchmarks, Datasets, and Evaluation Protocols

Omni video-text research is propelled by comprehensive benchmarks and large-scale, multi-modal datasets:

Synthetic and Real-World Corpora: VAST-27M (27M clips, vision/audio/subtitle/text captions) (Chen et al., 2023), BOSTVG (10,018 annotated videos for multi-object grounding) (Yao et al., 13 Mar 2025), RegVID-300k (video region-level instruction tuning) (Heo et al., 14 Jan 2025), and OmniV2V-Test (multi-type video editing) (Liang et al., 2 Jun 2025) anchor thorough, multimodal evaluation.
Task Families and Metrics: Evaluation spans retrieval (Recall@K), captioning (BLEU, CIDEr, METEOR), QA (accuracy), dense event localization (SODA_c, IoU), region-level and object tracking measures, as well as streaming-specific metrics (precision, latency, proactive task hit rate) (Xue et al., 16 Jun 2025, Xu et al., 3 Oct 2025, Wang et al., 29 Mar 2025).
Ablation and Module Importance: Studies quantify the impact of specific modules: e.g., graph retrieval and auxiliary evidence in AdaVideoRAG, sentence and box queries in OmniVid’s MQ-Former, and fusion strategies in Omni-Embed-Nemotron, isolating key enablers of performance gains (Xue et al., 16 Jun 2025, Wang et al., 26 Mar 2024, Xu et al., 3 Oct 2025).

7. Future Directions and Limitations

Several future research frontiers and constraints are articulated across current omni video-text work:

Granularity and Efficiency: Current retrieval–augmentation and intent classification schemes provide only coarse (three-level) adaptivity; extending to finer-grained or continuous retrieval strategies, as well as adjustable latency/quality trade-offs, remains open (Xue et al., 16 Jun 2025).
Extensibility and Real-Time Indexing: Incorporation of truly streaming RAG, online index updating, and expanded application to interactive multi-turn dialogue, summarization, or tutoring are noted as high-priority extensions (Xue et al., 16 Jun 2025, Xu et al., 26 Mar 2025).
Multimodal Alignment and Memory: Fusion strategies and temporal/contextual encoding are bottlenecks for long context and audio-visual reasoning, with memory-efficient transformers and learned cross-modal attention as promising directions (Wang et al., 29 Mar 2025, Xu et al., 3 Oct 2025).
Domain-Specific Adaptation and Bias: Domain-specific fine-tuning (e.g., HumanOmni for human-centric scenes (Zhao et al., 25 Jan 2025)) closes gaps not filled by generic omni models, though the risk of modality or corpus bias (e.g., in VAST-27M (Chen et al., 2023)) remains significant.
Interpretability and Coordination: Agent-based frameworks coordinating multiple foundation sub-models at inference time offer adaptability and interpretability but incur higher latency and coordination complexity (Lin et al., 4 Nov 2025).

Omni video-text models thus embody a multimodal, scalable, and extensible paradigm for video grounding, understanding, retrieval, and generation, laying the groundwork for more general, proactive, and context-aware machine perception and interaction.