Papers
Topics
Authors
Recent
Search
2000 character limit reached

ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video

Published 10 Jan 2024 in eess.AS, cs.CL, cs.CV, and cs.SD | (2401.05314v1)

Abstract: The Internet's wealth of content, with up to 60% published in English, starkly contrasts the global population, where only 18.8% are English speakers, and just 5.1% consider it their native language, leading to disparities in online information access. Unfortunately, automated processes for dubbing of video - replacing the audio track of a video with a translated alternative - remains a complex and challenging task due to pipelines, necessitating precise timing, facial movement synchronization, and prosody matching. While end-to-end dubbing offers a solution, data scarcity continues to impede the progress of both end-to-end and pipeline-based methods. In this work, we introduce Anim-400K, a comprehensive dataset of over 425K aligned animated video segments in Japanese and English supporting various video-related tasks, including automated dubbing, simultaneous translation, guided video summarization, and genre/theme/style classification. Our dataset is made publicly available for research purposes at https://github.com/davidmchan/Anim400K.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. S. Agrawal et al., “Findings of the iwslt 2023 evaluation campaign,” in IWSLT, 2023, pp. 1–61.
  2. M. A. Di Gangi et al., “Must-c: a multilingual speech translation corpus,” in NAACL: Human Language Technologies.   Association for Computational Linguistics, 2019, pp. 2012–2017.
  3. C. Federmann and W. Lewis, “Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german,” in Proceedings of the 13th International Conference on Spoken Language Translation, 2016.
  4. A. Karakanta, M. Negri, and M. Turchi, “Must-cinema: a speech-to-subtitles corpus,” arXiv:2002.10829, 2020.
  5. A. Öktem et al., “Bilingual prosodic dataset compilation for spoken language translation,” IberSpeech, 2018.
  6. Y. Yang et al., “Large-scale multilingual audio visual dubbing,” arXiv:2011.03530, 2020.
  7. C. I. Agency, “World,” The World Factbook, 2023. [Online]. Available: https://www.cia.gov/the-world-factbook
  8. C. M. Koolstra, A. L. Peeters, and H. Spinhof, “The pros and cons of dubbing and subtitling,” European Journal of Communication, vol. 17, no. 3, pp. 325–354, 2002.
  9. B. Wissmath, D. Weibel, and R. Groner, “Dubbing or subtitling? effects on spatial presence, transportation, flow, and enjoyment,” Journal of Media Psychology, vol. 21, no. 3, pp. 114–125, 2009.
  10. S. Boonyubol, S. Kabir, and J. S. Cross, “Comparing mooc learners engagement with japanese videos and text to speech generated english videos,” in Proceedings of the Ninth ACM Conference on Learning@ Scale, 2022, pp. 317–320.
  11. Y. Wu et al., “Videodubber: Machine translation with speech-aware length control for video dubbing,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 772–13 779.
  12. A. Öktem, M. Farrús, and A. Bonafonte, “Prosodic Phrase Alignment for Machine Dubbing,” in Proc. Interspeech 2019, 2019, pp. 4215–4219.
  13. J. Effendi, Y. Virkar, R. Barra-Chicote, and M. Federico, “Duration modeling of neural tts for automatic dubbing,” in ICASSP.   IEEE, 2022, pp. 8037–8041.
  14. S. M. Lakew et al., “Machine translation verbosity control for automatic dubbing,” in ICASSP.   IEEE, 2021, pp. 7538–7542.
  15. S. M. Lakew, Y. Virkar, P. Mathur, and M. Federico, “Isometric mt: Neural machine translation for automatic dubbing,” in ICASSP.   IEEE, 2022, pp. 6242–6246.
  16. D. Tam, S. M. Lakew, Y. Virkar, P. Mathur, and M. Federico, “Isochrony-aware neural machine translation for automatic dubbing,” arXiv:2112.08548, 2021.
  17. Y. Virkar, M. Federico, R. Enyedi, and R. Barra-Chicote, “Improvements to prosodic alignment for automatic dubbing,” in ICASSP.   IEEE, 2021, pp. 7543–7574.
  18. W. Brannon, Y. Virkar, and B. Thompson, “Dubbing in practice: A large scale study of human localization with insights for automatic dubbing,” ACL, vol. 11, pp. 419–435, 2023.
  19. X. Yang, Y.-N. Chen, D. Hakkani-Tür, P. Crook, X. Li, J. Gao, and L. Deng, “End-to-end joint learning of natural language understanding and dialogue manager,” in ICASSP.   IEEE, 2017, pp. 5690–5694.
  20. J. Swiatkowski et al., “Cross-lingual prosody transfer for expressive machine dubbing,” arXiv:2306.11658, 2023.
  21. M. Federico et al., “Evaluating and optimizing prosodic alignment for automatic dubbing,” 2020.
  22. N. Singh et al., “Looking similar, sounding different: Leveraging counterfactual cross-modal pairs for audiovisual representation learning,” arXiv:2304.05600, 2023.
  23. W. Chafe and D. Tannen, “The relation between written and spoken language,” Annual review of anthropology, vol. 16, no. 1, pp. 383–407, 1987.
  24. A. Öktem, M. Farrús, and L. Wanner, “Automatic extraction of parallel speech corpora from dubbed movies,” in BUCC.   ACL (Association for Computational Linguistics), 2017.
  25. H. Bredin et al., “End-to-end speaker segmentation for overlap-aware resegmentation,” in Proc. Interspeech 2021, 2021.
  26. R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” Journal of Open Source Software, 2020, deezer Research.
  27. N. Schinkel-Bielefeld, N. Lotze, and F. Nagel, “Does understanding of test items help or hinder subjective assessment of basic audio quality?” in Audio Engineering Society Convention 133.   Audio Engineering Society, 2012.
  28. B. Series, “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union Radiocommunication Assembly, 2014.
  29. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
  30. E. Casanova et al., “Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone,” in International Conference on Machine Learning.   PMLR, 2022, pp. 2709–2720.
  31. P. Meena, H. Kumar, and S. K. Yadav, “A review on video summarization techniques,” Engineering Applications of Artificial Intelligence, vol. 118, p. 105667, 2023.
  32. Z. Kurt and K. Özkan, “An image-based recommender system based on feature extraction techniques,” in UBMK.   IEEE, 2017, pp. 769–774.
  33. C. Gan et al., “Stylenet: Generating attractive visual captions with styles,” in CVPR, 2017, pp. 3137–3146.
  34. H. Li, S. Guo, K. Lyu, X. Yang, T. Chen, J. Zhu, and H. Zeng, “A challenging benchmark of anime style recognition,” in CVPR, 2022, pp. 4721–4730.
  35. Z. Li, Y. Xu, N. Zhao, Y. Zhou, Y. Liu, D. Lin, and S. He, “Parsing-conditioned anime translation: A new dataset and method,” ACM Transactions on Graphics, vol. 42, no. 3, pp. 1–14, 2023.
  36. Z. Tu et al., “Rapique: Rapid and accurate video quality prediction of user generated content,” IEEE Open Journal of Signal Processing, vol. 2, pp. 425–440, 2021.
Citations (3)

Summary

  • The paper presents a large-scale dataset with over 425K aligned video segments that addresses the data scarcity in automated dubbing research.
  • It utilizes ASR tools and speaker diarization to accurately align multimodal elements, enhancing synchronization in Japanese and English videos.
  • The dataset supports diverse tasks including video summarization, character identification, and genre classification, fostering advancements in multimedia processing.

Anim-400K: A Dataset for Automated Dubbing and Beyond

The paper "Anim-400K: A Large-Scale Dataset for Automated End to End Dubbing of Video" introduces a significant contribution to the field of multimedia translation and video processing through the provision of a comprehensive dataset named Anim-400K. This dataset is designed to bridge the gap in automated dubbing research, particularly addressing the scarcity of data required to develop robust and nuanced dubbing systems.

Automated dubbing of video content, especially when aiming for end-to-end processing, involves complex tasks such as synchronizing translated audio with the timing, facial movements, and prosody of the original content. Until now, progress in this area has been hindered by a lack of extensive and aligned datasets suitable for training and evaluating deep learning models. Anim-400K emerges as a solution with over 425,000 aligned video segments containing Japanese and English dubbed versions, making it significantly larger than existing datasets.

Overview of the Anim-400K Dataset

Anim-400K sets itself apart by providing an unprecedented volume of data compared to previous datasets such as the Heroes corpus and IWSLT test sets. The dataset's scale makes it a powerful tool for developing end-to-end dubbing systems capable of capturing nuances in speaker performance and synchronizing multimodal elements in video content more effectively. The dataset includes not only the aligned audio clips but also metadata that supports numerous secondary tasks, enhancing its utility across different research areas.

Data Collection and Annotation

The dataset was compiled by scraping publicly accessible dubbed anime videos from online platforms, capturing high-quality audio and video tracks in both English and Japanese. Moreover, extensive metadata accompanying the episodes and characters enriches the dataset, facilitating research in character identification, genre classification, and video summarization.

The method employed for clipping the audio involves a top-down approach. This ensures a broader segment alignment with the video, despite minor noise. By utilizing ASR tools and speaker diarization, the dataset provides precise speaker identification, essential for handling multi-speaker scenarios in dubbing.

Supported Tasks and Implications

Beyond its primary application in automated dubbing, Anim-400K supports various secondary research tasks. The metadata provides a foundation for:

  1. Video Summarization: Human-generated episode summaries assist in evaluating automated video summarization models.
  2. Character Identification: Detailed character metadata and imagery aid research in visual analysis and character recognition.
  3. Genre and Theme Classification: Genre and theme labels allow for genre-based research and recommendation system advancements.
  4. Video Quality Analysis: Collected user ratings at both show and episode levels provide a basis for exploring video quality assessment metrics.
  5. Simultaneous Translation: The dataset acts as a resource for simultaneous translation tasks, especially beneficial for Japanese to English translation research.

Limitations and Ethical Considerations

While Anim-400K presents a remarkable asset to the field, it also raises ethical considerations. The dataset's focus on anime might lead to cultural and genre biases in systems trained solely on this content. Furthermore, the automated systems drawing from this dataset must navigate the complexities of maintaining cultural sensitivity, high translation quality, and ethical compliance with user privacy and copyright laws.

The paper underlines the necessity for ongoing refinement and ethical oversight in the development of dubbing systems to ensure that advancements do not come at the cost of cultural insensitivity or diminished translation fidelity.

Conclusion

In sum, Anim-400K stands as a pivotal dataset for advancing automated dubbing methodologies and supporting diverse multimedia tasks. Its extensive scale, coupled with rich metadata, lays the groundwork for significant progress in video translation and processing technology, provided that ethical and practical challenges are diligently addressed in future research endeavors.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 61 likes about this paper.