Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing (2211.16934v2)

Published 30 Nov 2022 in cs.CL, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech synthesis. To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech, which requires strict length control. Previous works usually control the number of words or characters generated by the machine translation model to be similar to the source sentence, without considering the isochronicity of speech as the speech duration of words/characters in different languages varies. In this paper, we propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation, to match the length of source and target speech. Specifically, we control the speech length of generated sentence by guiding the prediction of each word with the duration information, including the speech duration of itself as well as how much duration is left for the remaining words. We design experiments on four language directions (German -> English, Spanish -> English, Chinese <-> English), and the results show that the proposed method achieves better length control ability on the generated speech than baseline methods. To make up the lack of real-world datasets, we also construct a real-world test set collected from films to provide comprehensive evaluations on the video dubbing task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. MuST-C: A multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66: 101155.
  2. Duration Modeling of Neural TTS for Automatic Dubbing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8037–8041. IEEE.
  3. From Speech-to-Speech Translation to Automatic Dubbing. In Proceedings of the 17th International Conference on Spoken Language Translation, 257–264. Online: Association for Computational Linguistics.
  4. Evaluating and Optimizing Prosodic Alignment for Automatic Dubbing. In INTERSPEECH, 1481–1485.
  5. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. CoRR, abs/2201.03713.
  6. Kruspe, A. M. 2015. Training Phoneme Models for Singing with ”Songified” Speech Data. In ISMIR, 336–342.
  7. Machine translation verbosity control for automatic dubbing. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7538–7542. IEEE.
  8. Controlling the Output Length of Neural Machine Translation. In IWSLT. Association for Computational Linguistics.
  9. Isometric mt: Neural machine translation for automatic dubbing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6242–6246. IEEE.
  10. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In INTERSPEECH, 498–502. ISCA.
  11. Prosodic Phrase Alignment for Machine Dubbing. In Kubin, G.; and Kacic, Z., eds., Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, 4215–4219. ISCA.
  12. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL, 311–318. ACL.
  13. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In ICLR. OpenReview.net.
  14. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  15. Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing. In Interspeech, 3151–3155.
  16. Positional Encoding to Control Output Sequence Length. In NAACL-HLT (1), 3999–4004. Association for Computational Linguistics.
  17. Isochrony-Aware Neural Machine Translation for Automatic Dubbing. In Ko, H.; and Hansen, J. H. L., eds., Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, 1776–1780. ISCA.
  18. NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality. arXiv preprint arXiv:2205.04421.
  19. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561.
  20. Attention is All you Need. In NIPS, 5998–6008.
  21. Attention is All you Need. In Guyon, I.; von Luxburg, U.; Bengio, S.; Wallach, H. M.; Fergus, R.; Vishwanathan, S. V. N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 5998–6008.
  22. Improvements to prosodic alignment for automatic dubbing. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7543–7574. IEEE.
  23. Prosodic Alignment for off-screen automatic dubbing. arXiv preprint arXiv:2204.02530.
  24. Wahlster, W. 2013. Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media.
  25. CoVoST 2 and Massively Multilingual Speech Translation. In Hermansky, H.; Cernocký, H.; Burget, L.; Lamel, L.; Scharenborg, O.; and Motlícek, P., eds., Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, 2247–2251. ISCA.
  26. AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios. In Ko, H.; and Hansen, J. H. L., eds., Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, 2568–2572. ISCA.
  27. Automatic speech recognition, volume 1. Springer.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yihan Wu (44 papers)
  2. Junliang Guo (39 papers)
  3. Xu Tan (164 papers)
  4. Chen Zhang (403 papers)
  5. Bohan Li (88 papers)
  6. Ruihua Song (48 papers)
  7. Lei He (121 papers)
  8. Sheng Zhao (75 papers)
  9. Arul Menezes (15 papers)
  10. Jiang Bian (229 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.