Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech (2302.13700v1)

Published 27 Feb 2023 in cs.LG, cs.CV, cs.SD, and eess.AS

Abstract: The goal of this work is zero-shot text-to-speech synthesis, with speaking styles and voices learnt from facial characteristics. Inspired by the natural fact that people can imagine the voice of someone when they look at his or her face, we introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework learnt from visible attributes, called Face-TTS. This is the first time that face images are used as a condition to train a TTS model. We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments. We also propose a speaker feature binding loss to enforce the similarity of the generated and the ground truth speech segments in speaker embedding space. Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild audio-visual corpus containing background noise and diverse speaking styles. The project page is https://facetts.github.io.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jiyoung Lee (42 papers)
  2. Joon Son Chung (106 papers)
  3. Soo-Whan Chung (21 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets