Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic Voice Over (2110.03342v3)

Published 7 Oct 2021 in eess.AS and cs.CL

Abstract: In this paper, we formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO). Unlike traditional speech synthesis, AVO seeks to generate not only human-sounding speech, but also perfect lip-speech synchronization. A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video. We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization. The proposed VisualTTS adopts two novel mechanisms that are 1) textual-visual attention, and 2) visual fusion strategy during acoustic decoding, which both contribute to forming accurate alignment between the input text content and lip motion in input lip sequence. Experimental results show that VisualTTS achieves accurate lip-speech synchronization and outperforms all baseline systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Junchen Lu (7 papers)
  2. Berrak Sisman (49 papers)
  3. Rui Liu (320 papers)
  4. Mingyang Zhang (56 papers)
  5. Haizhou Li (286 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.