StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing (2402.12636v3)
Abstract: Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track. Existing state-of-the-art V2C models break the phonemes in the script according to the divisions between video frames, which solves the temporal alignment problem but leads to incomplete phoneme pronunciation and poor identity stability. To address this problem, we propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync. Extensive experiments on two of the primary benchmarks, V2C and Grid, demonstrate the favorable performance of the proposed method as compared to the current stateof-the-art. The code will be made available at https://github.com/GalaxyCong/StyleDubber.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In ICML, pages 2709–2720.
- Multispeech: Multi-speaker text to speech with transformer. In Interspeech, pages 4024–4028.
- V2C: visual voice cloning. In CVPR, pages 21210–21219.
- Learning to dub movies via hierarchical prosody models. In CVPR, pages 14687–14697.
- An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424.
- Phoneme dependent speaker embedding and model factorization for multi-speaker speech synthesis and adaptation. In ICASSP, pages 6930–6934.
- More than words: In-the-wild visually-driven prosody for text-to-speech. In CVPR, pages 10577–10587.
- Neural dubber: Dubbing for videos according to scripts. In NeurIPS, pages 16582–16595.
- Fastdiff 2: Revisiting and incorporating gans and diffusion models in high-fidelity speech synthesis. In ACL, pages 6994–7009.
- Prosody-tts: Improving prosody with masked autoencoder and conditional diffusion model for expressive text-to-speech. In ACL, pages 8018–8034.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In NeurIPS, pages 17022–17033.
- Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687.
- Imaginary voice: Face-styled diffusion model for text-to-speech. In ICASSP, pages 1–5.
- Towards multi-scale style control for expressive speech synthesis. In Interspeech, pages 4673–4677.
- Text-to-speech for low-resource agglutinative language with morphology-aware language model pre-training. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1075–1087.
- The basic of english phonology: A literature review. Jurnal Insan Pendidikan dan Sosial Humaniora, 1(3):126–136.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, pages 498–502.
- Corey Andrew Miller. 1998. Pronunciation modeling in speech synthesis. University of Pennsylvania.
- Meta-stylespeech : Multi-speaker adaptive text-to-speech generation. In ICML, pages 7748–7759.
- Robust speech recognition via large-scale weak supervision. In ICML, pages 28492–28518.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR.
- Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In ICASSP, pages 4779–4783.
- Dian: Duration informed auto-regressive network for voice cloning. In ICASSP, pages 8598–8602.
- Naturalspeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–12.
- Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nat. Mach. Intell., 3(1):42–50.
- Multimodal transformer for unaligned multimodal language sequences. In ACL, pages 6558–6569.
- Generalized end-to-end loss for speaker verification. In ICASSP, pages 4879–4883.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111.
- S3fd: Single shot scale-invariant face detector. In CVPR, pages 192–201.
- Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis. In Interspeech, pages 2573–2577.
- Gaoxiang Cong (5 papers)
- Yuankai Qi (46 papers)
- Liang Li (297 papers)
- Amin Beheshti (31 papers)
- Zhedong Zhang (31 papers)
- Anton van den Hengel (188 papers)
- Ming-Hsuan Yang (377 papers)
- Chenggang Yan (54 papers)
- Qingming Huang (168 papers)