Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 70 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 185 tok/s Pro

GPT OSS 120B 441 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

Vidi: Large Multimodal Models for Video Understanding and Editing (2504.15681v2)

Published 22 Apr 2025 in cs.CV

Abstract: Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than videos of existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

Summary

An Expert Review of "Vidi: Large Multimodal Models for Video Understanding and Editing"

The paper under discussion introduces "Vidi," a suite of large multimodal models aimed at enhancing video understanding and editing capabilities across extensive video content. Vidi primarily targets temporal retrieval scenarios, enabling precise identification of time segments within long videos based on textual queries. It addresses significant challenges present in video editing, particularly on mobile devices, where segment identification and composition are labor-intensive and technically demanding.

Vidi's Contributions and Benchmark

Central to the paper is the introduction of the VUE-TR benchmark, which encompasses five critical advancements tailored for real-world applicability:

Video Duration: The benchmark supports videos ranging from 20 seconds to over an hour, far surpassing the span supported by existing datasets.
Audio Support: It includes audio-based queries, acknowledging the integral role audio plays in video comprehension and temporal localization.
Query Format: The benchmark accommodates queries of varying lengths/formats: keywords, phrases, and sentences.
Annotation Quality: Manually annotated ground-truth time ranges ensure high accuracy.
Evaluation Metric: A refined IoU metric is proposed to evaluate multiple time ranges, enhancing precision in assessment.

Vidi demonstrates its superiority by outperforming prominent models, such as GPT-4o and Gemini, in temporal retrieval tasks. This performance underscores Vidi's robust processing of hour-long videos with complex multimodal inputs.

Model Architecture and Training

Vidi's architecture uses decomposed attention mechanisms to efficiently manage multimodal inputs from text, vision, and audio. This approach significantly reduces computational complexity, enabling the model to process long-form video content without sacrificing performance. The training involves multimodal alignment across three stages—adapter training, synthetic data training, and real video training—culminating in a model finely tuned for practical and diverse video editing applications.

Implications of Temporal Retrieval

Temporal retrieval is vital for modern video production processes. By addressing this task, Vidi aids users in automating tedious workflows like identifying relevant segments from extensive footage, ultimately facilitating advanced video creation systems. Future developments in this area could expand Vidi’s applications to broader video understanding tasks and interactive editing scenarios, further bridging the gap between AI models and consumer-ready video editing solutions.

Evaluation Metrics and Results

The paper details an innovative metric system that supports evaluation in temporal retrieval scenarios involving multiple time ranges—another step forward from conventional metrics grounded in simpler tasks. On the VUE-TR benchmark, Vidi’s model shows consistent outperformance across various categories, demonstrating higher precision and recall rates. Its capacity to handle real-world scenarios involving both visual and auditory inputs further cements its standing.

In conclusion, Vidi represents a significant development in multimodal AI models geared towards video understanding and editing, with its refined approach to temporal retrieval and robust benchmark supporting tangible advancements in video processing applications. While the paper does not characterize its contributions as groundbreaking, it nevertheless lays a critical foundation for future work aimed at enhancing user experiences in content creation.