Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text (2106.14014v3)

Published 26 Jun 2021 in eess.IV and cs.MM

Abstract: Video represents the majority of internet traffic today, driving a continual race between the generation of higher quality content, transmission of larger file sizes, and the development of network infrastructure. In addition, the recent COVID-19 pandemic fueled a surge in the use of video conferencing tools. Since videos take up considerable bandwidth (~100 Kbps to a few Mbps), improved video compression can have a substantial impact on network performance for live and pre-recorded content, providing broader access to multimedia content worldwide. We present a novel video compression pipeline, called Txt2Vid, which dramatically reduces data transmission rates by compressing webcam videos ("talking-head videos") to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning based voice cloning and lip syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs (encoders-decoders), while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n = 242) in an online study. The Txt2Vid framework opens up the potential for creating novel applications such as enabling audio-video communication during poor internet connectivity, or in remote terrains with limited bandwidth. The code for this work is available at https://github.com/tpulkit/txt2vid.git.

Citations (21)

View on Semantic Scholar

Summary

The paper introduces a novel generative pipeline that compresses talking-head videos into text and reconstructs them using advanced deep learning models.
It achieves compression rates two to three orders of magnitude lower than conventional codecs while maintaining high perceptual quality.
The method integrates Speech-to-Text, Text-to-Speech, and lip-sync models to deliver efficient, adaptive multimedia transmission.

Ultra-Low Bitrate Compression of Talking-Head Videos via Text: An Overview of Txt2Vid

The paper "Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text" explores a novel approach for compressing video content, particularly focusing on talking-head videos commonly seen in video conferencing. The paper introduces a generative compression pipeline, Txt2Vid, that reduces video to text and reconstructs it using advanced deep learning models for voice cloning and lip-syncing. This approach demonstrates significant reductions in data rates while maintaining user perception of quality.

The authors tackle a pressing issue: the immense data consumption associated with video streaming, exacerbated by the surge in conferencing needs during the COVID-19 pandemic. Traditional video codecs like H.264 and AV1 rely on pixel-based fidelity, which may not necessarily align with subjective Quality-of-Experience (QoE). The proposed Txt2Vid pipeline offers a perceptual compression alternative that centers on pertinent information transmission—specifically the spoken content—and reconstructs audiovisual content with equivalent perceived quality by viewers.

Methodology and Contributions

The Txt2Vid process involves several innovative steps:

Text Extraction and Compression: The original video is encoded as text using Speech-to-Text (STT) methods, with further compression applied.
Text-to-Speech (TTS) and Lip-Sync: At the receiving end, the text is transformed back into audio using TTS, and then lip-synced with a stored driving video using models like Resemble for voice cloning and Wav2Lip for visual generation.
Generative Decoding: The generative approach uses prior knowledge from the initial transmission for realistic video reconstruction.

This method achieves compression rates two to three orders of magnitude lower than state-of-the-art audio-video codecs like AAC and H.264/AV1. Subjective studies with human participants indicated that the generative outputs of Txt2Vid were preferred at similar QoE compared to heavily compressed versions of the original video.

Implications and Future Developments

Practically, Txt2Vid could substantially reduce the bandwidth needed for video transmissions, potentially democratizing access to video technology in low-bandwidth regions. Theoretically, it alters the conventional methods of video compression by focusing not on the media's physical aspects but on perception-based metrics.

This work also opens up intriguing applications in adaptive communication technologies—where bandwidth varies—allowing flexible audiovisual experiences and use in environments such as remote terrains or even extraterrestrial habitats. By storing educational content in text form and using these techniques to reconstruct it, this method enables learning personalization.

While the promise of Txt2Vid is significant, there are practical challenges and ethical considerations to address. Computational complexities, potential for misuse, and latency remain concerns. Future work will need to refine the model for real-time applications and include additional metadata transmission to address non-verbal communications in reconstructed video.

Despite the limitations relative to current computational capabilities, Txt2Vid demonstrates the scope for re-imagining video conferencing and other multimedia content delivery methods, hinting towards a widespread paradigm shift powered by advances in generative AI technologies.

PDF Markdown

Related Papers

GitHub

GitHub - tpulkit/txt2vid (93 stars)