ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video
Abstract: The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex and challenging task due to pipelines, necessitating precise timing, facial movement synchronization, and prosody matching. While end-to-end dubbing offers a solution, data scarcity continues to impede the progress of both end-to-end and pipeline-based methods. In this work, we introduce Anim-400K, a comprehensive dataset of over 425K aligned animated video segments in Japanese and English supporting various video-related tasks, including automated dubbing, simultaneous translation, guided video summarization, and genre/theme/style classification. Our dataset is made publicly available for research purposes at https://github.com/davidmchan/Anim400K.
- S. Agrawal et al., “Findings of the iwslt 2023 evaluation campaign,” in IWSLT, 2023, pp. 1–61.
- M. A. Di Gangi et al., “Must-c: a multilingual speech translation corpus,” in NAACL: Human Language Technologies. Association for Computational Linguistics, 2019, pp. 2012–2017.
- C. Federmann and W. Lewis, “Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german,” in Proceedings of the 13th International Conference on Spoken Language Translation, 2016.
- A. Karakanta, M. Negri, and M. Turchi, “Must-cinema: a speech-to-subtitles corpus,” arXiv:2002.10829, 2020.
- A. Öktem et al., “Bilingual prosodic dataset compilation for spoken language translation,” IberSpeech, 2018.
- Y. Yang et al., “Large-scale multilingual audio visual dubbing,” arXiv:2011.03530, 2020.
- C. I. Agency, “World,” The World Factbook, 2023. [Online]. Available: https://www.cia.gov/the-world-factbook
- C. M. Koolstra, A. L. Peeters, and H. Spinhof, “The pros and cons of dubbing and subtitling,” European Journal of Communication, vol. 17, no. 3, pp. 325–354, 2002.
- B. Wissmath, D. Weibel, and R. Groner, “Dubbing or subtitling? effects on spatial presence, transportation, flow, and enjoyment,” Journal of Media Psychology, vol. 21, no. 3, pp. 114–125, 2009.
- S. Boonyubol, S. Kabir, and J. S. Cross, “Comparing mooc learners engagement with japanese videos and text to speech generated english videos,” in Proceedings of the Ninth ACM Conference on Learning@ Scale, 2022, pp. 317–320.
- Y. Wu et al., “Videodubber: Machine translation with speech-aware length control for video dubbing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 772–13 779.
- A. Öktem, M. Farrús, and A. Bonafonte, “Prosodic Phrase Alignment for Machine Dubbing,” in Proc. Interspeech 2019, 2019, pp. 4215–4219.
- J. Effendi, Y. Virkar, R. Barra-Chicote, and M. Federico, “Duration modeling of neural tts for automatic dubbing,” in ICASSP. IEEE, 2022, pp. 8037–8041.
- S. M. Lakew et al., “Machine translation verbosity control for automatic dubbing,” in ICASSP. IEEE, 2021, pp. 7538–7542.
- S. M. Lakew, Y. Virkar, P. Mathur, and M. Federico, “Isometric mt: Neural machine translation for automatic dubbing,” in ICASSP. IEEE, 2022, pp. 6242–6246.
- D. Tam, S. M. Lakew, Y. Virkar, P. Mathur, and M. Federico, “Isochrony-aware neural machine translation for automatic dubbing,” arXiv:2112.08548, 2021.
- Y. Virkar, M. Federico, R. Enyedi, and R. Barra-Chicote, “Improvements to prosodic alignment for automatic dubbing,” in ICASSP. IEEE, 2021, pp. 7543–7574.
- W. Brannon, Y. Virkar, and B. Thompson, “Dubbing in practice: A large scale study of human localization with insights for automatic dubbing,” ACL, vol. 11, pp. 419–435, 2023.
- X. Yang, Y.-N. Chen, D. Hakkani-Tür, P. Crook, X. Li, J. Gao, and L. Deng, “End-to-end joint learning of natural language understanding and dialogue manager,” in ICASSP. IEEE, 2017, pp. 5690–5694.
- J. Swiatkowski et al., “Cross-lingual prosody transfer for expressive machine dubbing,” arXiv:2306.11658, 2023.
- M. Federico et al., “Evaluating and optimizing prosodic alignment for automatic dubbing,” 2020.
- N. Singh et al., “Looking similar, sounding different: Leveraging counterfactual cross-modal pairs for audiovisual representation learning,” arXiv:2304.05600, 2023.
- W. Chafe and D. Tannen, “The relation between written and spoken language,” Annual review of anthropology, vol. 16, no. 1, pp. 383–407, 1987.
- A. Öktem, M. Farrús, and L. Wanner, “Automatic extraction of parallel speech corpora from dubbed movies,” in BUCC. ACL (Association for Computational Linguistics), 2017.
- H. Bredin et al., “End-to-end speaker segmentation for overlap-aware resegmentation,” in Proc. Interspeech 2021, 2021.
- R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” Journal of Open Source Software, 2020, deezer Research.
- N. Schinkel-Bielefeld, N. Lotze, and F. Nagel, “Does understanding of test items help or hinder subjective assessment of basic audio quality?” in Audio Engineering Society Convention 133. Audio Engineering Society, 2012.
- B. Series, “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union Radiocommunication Assembly, 2014.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
- E. Casanova et al., “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning. PMLR, 2022, pp. 2709–2720.
- P. Meena, H. Kumar, and S. K. Yadav, “A review on video summarization techniques,” Engineering Applications of Artificial Intelligence, vol. 118, p. 105667, 2023.
- Z. Kurt and K. Özkan, “An image-based recommender system based on feature extraction techniques,” in UBMK. IEEE, 2017, pp. 769–774.
- C. Gan et al., “Stylenet: Generating attractive visual captions with styles,” in CVPR, 2017, pp. 3137–3146.
- H. Li, S. Guo, K. Lyu, X. Yang, T. Chen, J. Zhu, and H. Zeng, “A challenging benchmark of anime style recognition,” in CVPR, 2022, pp. 4721–4730.
- Z. Li, Y. Xu, N. Zhao, Y. Zhou, Y. Liu, D. Lin, and S. He, “Parsing-conditioned anime translation: A new dataset and method,” ACM Transactions on Graphics, vol. 42, no. 3, pp. 1–14, 2023.
- Z. Tu et al., “Rapique: Rapid and accurate video quality prediction of user generated content,” IEEE Open Journal of Signal Processing, vol. 2, pp. 425–440, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.