Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BigVideo: A Large-scale Video Subtitle Translation Dataset for Multimodal Machine Translation (2305.18326v3)

Published 23 May 2023 in cs.CV and cs.AI

Abstract: We present a large-scale video subtitle translation dataset, BigVideo, to facilitate the study of multi-modality machine translation. Compared with the widely used How2 and VaTeX datasets, BigVideo is more than 10 times larger, consisting of 4.5 million sentence pairs and 9,981 hours of videos. We also introduce two deliberately designed test sets to verify the necessity of visual information: Ambiguous with the presence of ambiguous words, and Unambiguous in which the text context is self-contained for translation. To better model the common semantics shared across texts and videos, we introduce a contrastive learning method in the cross-modal encoder. Extensive experiments on the BigVideo show that: a) Visual information consistently improves the NMT model in terms of BLEU, BLEURT, and COMET on both Ambiguous and Unambiguous test sets. b) Visual information helps disambiguation, compared to the strong text baseline on terminology-targeted scores and human evaluation. Dataset and our implementations are available at https://github.com/DeepLearnXMU/BigVideo-VMT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. On the evaluation of machine translation for terminology consistency. arXiv preprint arXiv:2106.11891.
  2. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics.
  3. Cross-lingual visual pre-training for multimodal machine translation. arXiv preprint arXiv:2101.10044.
  4. Probing the need for visual context in multimodal machine translation. In Proc. of NAACL.
  5. João Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proc. of CVPR.
  6. A simple framework for contrastive learning of visual representations. In Proc. of ICML.
  7. Xnli: Evaluating cross-lingual sentence representations. In Proc. of EMNLP.
  8. Mixed precision training of convolutional neural networks using integer operations. In Proc. of ICLR.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. of ICLR.
  10. Multi30K: Multilingual English-German image descriptions. In Proc. of the 5th Workshop on Vision and Language.
  11. Desmond Elliott and Ákos Kádár. 2017. Imagination improves multimodal translation. In Proc. of IJCNLP.
  12. Slowfast networks for video recognition. In Proc. of ICCV.
  13. Cross-lingual visual verb sense disambiguation. In Proc. of ACL.
  14. Video-guided machine translation with spatial hierarchical attention network. In Proc. of ACL: Student Research Workshop.
  15. Deep residual learning for image recognition. In Proc. of CVPR.
  16. Keyframe segmentation and positional encoding for video-guided machine translation challenge 2020. arXiv preprint arxiv:2006.12799.
  17. Multimodal pivots for image caption translation. In Proc. of ACL.
  18. Exploring better text image translation with multimodal codebook. In Proc. of ACL.
  19. On vision features in multimodal machine translation. In Proc. of ACL.
  20. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proc. of AAAI.
  21. Vision matters when it should: Sanity checking multimodal machine translation models. In Proc. of EMNLP.
  22. VISA: An ambiguous subtitles dataset for visual scene-aware machine translation. In Proc. of LREC.
  23. Dynamic context-guided capsule network for multimodal machine translation. In Proc. of ACMMM.
  24. Learning to localize actions from moments. In Proc. of ECCV.
  25. Is the elephant flying? resolving ambiguities in text-to-image generative models. arXiv preprint arxiv:2211.12503.
  26. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proc. of ICCV.
  27. Revisiting round-trip translation for quality estimation. In Proc. of EACL.
  28. fairseq: A fast, extensible toolkit for sequence modeling. In Proc. of NAACL.
  29. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL.
  30. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proc. of CMT.
  31. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966.
  32. COMET: A neural framework for MT evaluation. In Proc. of EMNLP.
  33. Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. of NIPS.
  34. How2: A large-scale dataset for multimodal language understanding. arXiv preprint arxiv:1811.00347.
  35. BLEURT: Learning robust metrics for text generation. In Proc. of ACL.
  36. A study of translation edit rate with targeted human annotation. In Proc. of AMTA.
  37. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Proc. of NIPS.
  38. Multi-modal neural machine translation with deep semantic interactions. Inf. Sci.
  39. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proc. of LREC.
  40. Attention is all you need. In Proc. of NIPS.
  41. Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In Proc. of CVPR.
  42. MultiSubs: A large-scale multimodal and multilingual dataset. In Proc. of LREC.
  43. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proc. of ICCV.
  44. Good for misconceived reasons: An empirical revisiting on the need for visual context in multimodal machine translation. In Proc. of ACL.
  45. Xgpt: Cross-modal generative pre-training for image captioning. In Proc. of NLPCC.
  46. Xlnet: Generalized autoregressive 897 pretraining for language understanding. In Proc. of NIPS.
  47. Why videos do not guide translations in video-guided machine translation? an empirical evaluation of video-guided machine translation dataset. J. Inf. Process.
  48. Shaowei Yao and Xiaojun Wan. 2020. Multimodal transformer for multimodal machine translation. In Proc. of ACL.
  49. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proc. of ACL.
  50. Beyond triplet: Leveraging the most data for multimodal machine translation. arXiv preprint arXiv:2212.10313.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Liyan Kang (2 papers)
  2. Luyang Huang (8 papers)
  3. Ningxin Peng (5 papers)
  4. Peihao Zhu (15 papers)
  5. Zewei Sun (15 papers)
  6. Shanbo Cheng (23 papers)
  7. Mingxuan Wang (83 papers)
  8. Degen Huang (8 papers)
  9. Jinsong Su (96 papers)
Citations (7)