TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation (2312.15197v1)
Abstract: Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. This approach circumvents delays and cascading errors associated with model cascading. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, \textbf{TransFace}, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that our proposed Unit2Lip model significantly improves synchronization (1.601 and 0.982 on LSE-C for the original and generated audio speech, respectively) and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations.
- Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018a.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018b.
- Findings of the iwslt 2022 evaluation campaign. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pp. 98–157. Association for Computational Linguistics, 2022.
- vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453, 2019.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- A statistical approach to machine translation. Computational linguistics, 16(2):79–85, 1990.
- Mixspeech: Cross-modality self-learning with audio-visual stream mixup for visual speech translation and recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15735–15745, 2023a.
- Opensr: Open-modality speech recognition via maintaining multi-modality alignment, 2023b.
- Out of time: automated lip sync in the wild. In Computer Vision–ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part II 13, pp. 251–263. Springer, 2017.
- Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2012–2017. Association for Computational Linguistics, 2019.
- High fidelity neural audio compression, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. In Advances in Neural Information Processing Systems.
- Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022a.
- Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 2595–2605, 2022b.
- Transpeech: Speech-to-speech translation with bilateral perturbation. arXiv preprint arXiv:2205.12523, 2022c.
- Av-transpeech: Audio-visual robust speech-to-speech translation. arXiv preprint arXiv:2305.15403, 2023a.
- Fastdiff 2: Revisiting and incorporating gans and diffusion models in high-fidelity speech synthesis. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 6994–7009, 2023b.
- Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037, 2019.
- Cvss corpus and massively multilingual speech-to-speech translation. arXiv preprint arXiv:2201.03713, 2022.
- Rethinking missing modality learning from a decoding perspective. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 4431–4439, 2023.
- Davis E King. Dlib-ml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755–1758, 2009.
- Dennis H Klatt. Review of text-to-speech conversion for english. The Journal of the Acoustical Society of America, 82(3):737–793, 1987.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
- Towards automatic face-to-face translation. In Proceedings of the 27th ACM international conference on multimedia, pp. 1428–1436, 2019.
- Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
- Janus-iii: Speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pp. 99–102. IEEE, 1997.
- Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
- Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352, 2021.
- Contrastive token-wise meta-learning for unseen performer visual temporal-aligned translation. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 10993–11007, 2023.
- Wav2sql: Direct generalizable speech-to-sql parsing. arXiv preprint arXiv:2305.12552, 2023.
- Generative spoken dialogue language modeling. Transactions of the Association for Computational Linguistics, 11:250–266, 2023.
- NVIDIA. Nvidia maxine: Reinventing real-time video communications with ai. 2022. URL https://developer.nvidia.com/maxine.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
- Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
- Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.
- Audio-visual automatic speech recognition: An overview. Issues in visual and audio-visual speech processing, 22:23, 2004.
- A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 484–492, 2020.
- Fastspeech: Fast, robust and controllable text to speech. Advances in neural information processing systems, 32, 2019.
- wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862, 2019.
- Seamless Communication. Seamlessm4t—massively multilingual & multimodal machine translation. ArXiv, 2023.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv preprint arXiv:2201.02184, 2022.
- Talking face generation with multilingual tts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21425–21430, 2022.
- Wolfgang Wahlster. Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013.
- Face-dubbing++: Lip-synchronous, voice preserving translation of videos. arXiv preprint arXiv:2206.04523, 2022.
- Face-dubbing++: Lip-synchronous, voice preserving translation of videos. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pp. 1–5. IEEE, 2023.
- Dong Yu and Li Deng. Automatic speech recognition, volume 1. Springer, 2016.
- Uwspeech: Speech to speech translation for unwritten languages. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 14319–14327, 2021.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023.
- Xize Cheng (29 papers)
- Rongjie Huang (62 papers)
- Linjun Li (43 papers)
- Tao Jin (52 papers)
- Zehan Wang (37 papers)
- Aoxiong Yin (12 papers)
- Minglei Li (19 papers)
- Xinyu Duan (15 papers)
- Zhou Zhao (218 papers)
- Changpeng Yang (9 papers)