Text-based Editing of Talking-head Video (1906.01524v1)

Published 4 Jun 2019 in cs.CV, cs.GR, and cs.LG

Abstract: Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis.

Authors (10)

Ohad Fried (34 papers)
Ayush Tewari (43 papers)
Michael Zollhöfer (51 papers)
Adam Finkelstein (9 papers)
Eli Shechtman (102 papers)
Dan B Goldman (15 papers)
Kyle Genova (21 papers)
Zeyu Jin (33 papers)
Christian Theobalt (251 papers)
Maneesh Agrawala (42 papers)

Citations (247)

View on Semantic Scholar

Summary

The paper presents a novel method that edits talking-head videos by aligning transcript changes with corresponding viseme sequences and 3D head model parameters.
It employs dynamic programming for optimal viseme selection and a recurrent neural network to generate smooth, photorealistic video transitions.
User studies indicate that edited videos are perceived as real 59.6% of the time, demonstrating its potential for cost-effective post-production editing.

Text-based Editing of Talking-head Video: An Expert Overview

The paper "Text-based Editing of Talking-head Video" by Fried et al. introduces a novel approach for editing pre-existing talking-head videos using a text-based interface. This innovative method allows editors to modify the spoken content by manipulating the corresponding transcripts, thereby synthesizing video that reflects these edits while maintaining a cohesive audio-visual presentation.

The proposed system operates by first aligning a given transcript with the input video at the phoneme level, followed by the registration of a 3D parametric head model. Once the preprocessing is done, the system identifies viseme sequences in the video that correspond to the edits specified in the transcript. It then re-times these sequences to match the durations required by the edits, blending the parameters of the head model to produce smooth transitions. Finally, a neural rendering network synthesizes the photorealistic output, resulting in seamless edits, even for challenging transitions at mid-sentence.

A pivotal strength of this method is its ability to generate convincing video edits simply from text inputs without any human intervention for dubbing or extensive re-shooting, which are typically costly and time-consuming. Specifically, the technique allows for the addition, removal, and rearrangement of text segments, including word insertion and full sentence synthesis. The paper details the technical innovations underlying this capability, notably the dynamic programming-based strategy for selecting viseme sequences and the recurrent video generation network that translates a composite of real and synthetic visuals into seamless real-world imagery.

The authors present several compelling results, including a user paper where the edited videos were perceived as real 59.6% of the time. While not matching the 80.6% realism rating for unedited videos, this performance is significant, illustrating the potential practicality of their approach. The methodological advances demonstrated are particularly relevant for applications in post-production editing, content creation, and adaptive media production. Moreover, these results suggest that the capability to create localized edits, as well as the potential for translating spoken dialogue across languages, can significantly enhance storytelling and educational content adaptation.

However, there are limitations to this method, including dependence on a substantial amount of source video to perform adequately, as well as challenges with synthesizing motion of obstructed areas like the lower face during edits. Future research directions may include enhancing the robustness of synthesis techniques to work with less data and further integrating this system with advanced audio synthesis tools to optimize speech-video synchronization. There is also a need for developing safeguards to mitigate potential misuse in falsifying content, underlying the ethical responsibilities associated with media synthesis technologies.

Overall, this paper marks a notable contribution to the field of computational video editing and verbal-visual content synthesis. As machine learning and neural rendering techniques continue to evolve, systems like the one presented will undoubtedly see further refinements and broader application across a range of sectors. The work of Fried et al. lays a solid foundation for continued exploration into AI-driven video editing, highlighting both the creative opportunities and ethical considerations that come with these advancements.

PDF Markdown

Related Papers

YouTube

Show All Videos