REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Published 24 May 2025 in cs.CV and cs.AI | (2505.18880v1)

Abstract: Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a LLM to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned LLM, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

An Insight into REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

The paper titled "REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing" addresses the challenges posed in the domain of video summarization, specifically targeting the transformation of long documentary footage into concise, engaging short videos. Unlike traditional extractive methods, which may yield disjointed content, and abstractive methods, often neglectful of original video material, this research introduces a hybrid model that situates itself at the crossroads of these approaches.

The proposed REGen framework utilizes a retrieval-embedded generation methodology, leveraging the capabilities of large language models (LLMs) for generating coherent narrative scripts interspersed with placeholders for direct video content. This content is then supplanted by a quotation retrieval model, which sources the most contextually congruent segments from the input video. This two-step process aims to construct short videos that maintain narrative coherence while effectively quoting multimodal resources.

The paper's core innovations include the integration of LLMs with multimodal retrieval systems, allowing for dynamic and meaningful insertions of video quotes into the narrative. The generation phase employs fine-tuned models of the LLaMA architecture, which produce scripts with strategically placed quotation markers. Following this, a custom retrieval module uses an encoder-decoder framework, which aligns textual content with visual clips. By maximizing contextual fit, this module identifies the most appropriate video segments to replace these placeholders.

Experimental validation is conducted using the DocumentaryNet dataset, supplemented by transcripts and speaker diarization using WhisperX technology. This enables the identification of quotable interview segments. Objective evaluations show superior performance of the REGen system in metrics such as quotation density index and quote coverage rate. The resultant narrative coherence and alignment are also favorably assessed through G-Eval scores.

Beyond immediate enhancements in video editing workflows, the paper's contributions in retrieval-augmented generation (RAG) for multimodal content provide a scalable method for delineating factual accuracy and narrative storytelling in media outputs. The research indicates pathways for extending these methodologies to broader domains, such as educational content creation and journalism, promising increased accessibility and engagement of information dissemination.

Future research directions could explore the optimization of REGen's retrieval systems across diverse datasets, the refining of LLM integrations for improved narrative alignment, and expansion into varied genres beyond documentary footage. The implications on AI's role in multimedia production and content creation underscore significant shifts towards more contextually aware and content-inclusive generative approaches. With an eye on factual alignment, the REGen model puts forward a compelling case for a nuanced approach in AI-driven video editing.