Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis (2411.17690v1)

Published 26 Nov 2024 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: In this paper, we propose a new task -- generating speech from videos of people and their transcripts (VTTS) -- to motivate new techniques for multimodal speech generation. This task generalizes the task of generating speech from cropped lip videos, and is also more complicated than the task of generating generic audio clips (e.g., dog barking) from videos and text. Multilingual versions of the task could lead to new techniques for cross-lingual dubbing. We also present a decoder-only multimodal model for this task, which we call Visatronic. This model embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech. By embedding all modalities into a common subspace, Visatronic can achieve improved results over models that use only text or video as input. Further, it presents a much simpler approach for multimodal speech generation compared to prevailing approaches which rely on lip-detectors and complicated architectures to fuse modalities while producing better results. Since the model is flexible enough to accommodate different ways of ordering inputs as a sequence, we carefully explore different strategies to better understand the best way to propagate information to the generative steps. To facilitate further research on VTTS, we will release (i) our code, (ii) clean transcriptions for the large-scale VoxCeleb2 dataset, and (iii) a standardized evaluation protocol for VTTS incorporating both objective and subjective metrics.

Authors (6)

Akshita Gupta (14 papers)
Tatiana Likhomanenko (41 papers)
Karren Dai Yang (3 papers)
Richard He Bai (2 papers)
Zakaria Aldeneh (20 papers)
Navdeep Jaitly (67 papers)

Summary

A Technical Overview of "Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis"

The paper, "Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis," proposes a novel method for synthesizing speech by generating audio from visual inputs and transcripts of spoken language. This new task, termed Video-Text-to-Speech (VTTS), involves creating a seamless fusion of multimodal data to produce human-like speech outputs. The suggested framework, known as Visatronic, integrates video, text, and speech inputs into a unified output domain using a decoder-only transformer architecture. This work advances the understanding of multimodal interactions and presents promising directions for future research.

Task Definition and Motivation

The primary task addressed by the authors is generating speech from video inputs and corresponding textual transcripts (VTTS). This complex task extends beyond traditional text-to-speech (TTS) models by integrating lip-reading and other visual cues directly into the speech generation process, eliminating the requirement for separate lip-tracking models. The proposed model is generalizable and can potentially apply to multilingual and cross-lingual tasks, such as dubbing videos in different languages, thus enhancing the toolset for automatic speech processing.

Model Architecture

Visatronic employs a decoder-only multimodal transformer model, which integrates various modalities—namely video, text, and speech—into a single, comprehensive learning framework. This approach leverages the synchronous nature of these inputs to produce discrete mel-spectrograms, which serve as the intermediate representation for synthesizing speech.

Key methodological elements include:

Unified Multimodal Embedding: Each input modality (video, text, and speech) is mapped into a shared embedding space. Video frames are processed via a VQ-VAE encoder, text is tokenized at the character level, and speech is quantized into mel-frequency values using dMel.
Autoregressive Learning: The model learns the distribution of mel-spectrograms conditioned on video and text through an autoregressive process, optimizing cross-entropy loss on the discretized speech representations.
Dynamic Input Strategies: The authors experiment with various strategies for temporal alignment of inputs, either preserving natural order or optimizing cross-modal interactions through attention-based fusion techniques.

Experimental Results

The Visatronic model was benchmarked on LRS3 and VoxCeleb2 datasets using both subjective and objective evaluation metrics:

Word Error Rate (WER): In terms of transcription accuracy, Visatronic surpasses traditional TTS models and other contemporary lip reading systems that rely solely on video input by a notable margin.
TimeSync: A novel metric used to measure temporal alignment between generated and reference speech, demonstrating that the inclusion of video inputs leads to improved synchrony.
Human Evaluation: Subjective tests on intelligibility, naturalness, and synchronization confirm the efficacy of Visatronic in producing fluid, coherent speech outputs, surpassing a baseline TTS model both perceptually and functionally.

Implications and Future Directions

Visatronic's innovative use of a unified multimodal space for speech synthesis provides significant insights into how different sensory modalities can be seamlessly integrated to improve generation models. The model's flexible architecture and ability to handle multiple input types paves the way for enhancements in various application areas such as automated dubbing and assistive technologies for the hearing impaired.

Furthermore, the authors highlight potential for extending this research into multilingual domains, which could lead to breakthroughs in real-time cross-lingual communication tools. The release of clean transcriptions for VoxCeleb2 and the standardized VTTS evaluation protocol are expected to drive further advancements and refinements in this burgeoning field.

Through merging audiovisual and textual information into a sophisticated generative framework, this paper makes a significant contribution to the field of computational linguistics and multimodal artificial intelligence, setting a vital precedent for future research endeavors.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1861615002890752387