Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MultiVENT: Multilingual Videos of Events with Aligned Natural Text (2307.03153v1)

Published 6 Jul 2023 in cs.IR, cs.CV, and cs.MM

Abstract: Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  2. Stanford i2v: a news video dataset for query-by-image experiments. In Proceedings of the 6th ACM Multimedia Systems Conference, pages 237–242, 2015.
  3. Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search & retrieval. arXiv preprint arXiv:2009.09984, 2020.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  5. A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508, 2022.
  6. A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing, 21(3):259–274, 2006.
  7. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  8. Chinaopen: A dataset for open-world multimodal learning. arXiv preprint arXiv:2305.05880, 2023.
  9. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  10. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. arXiv preprint arXiv:2210.02928, 2022.
  11. Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022.
  12. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.
  13. Rudder: A cross lingual video and text retrieval dataset. arXiv preprint arXiv:2103.05457, 2021.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Why we watch the news: a dataset for exploring sentiment in broadcast video news. In Proceedings of the 16th international conference on multimodal interaction, pages 104–111, 2014.
  16. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097, 2021.
  17. Bridging video-text retrieval with multiple choice questions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16167–16176, 2022.
  18. Wikipedia survey–overview of results. United Nations University: Collaborative Creativity Group, 8:1158–1178, 2010.
  19. Multilingual multimodal pre-training for zero-shot cross-lingual transfer of vision-language models. arXiv preprint arXiv:2103.08849, 2021.
  20. Watching the news: Towards videoqa models that can read. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4441–4450, 2023.
  21. Translationese and its dialects. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 1318–1326, 2011.
  22. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  23. mtvr: Multilingual moment retrieval in videos. arXiv preprint arXiv:2108.00061, 2021.
  24. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 447–463. Springer, 2020.
  25. Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9:1098–1115, 2021.
  26. Gaia: A fine-grained multimedia knowledge extraction system. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 77–86, 2020.
  27. Cross-media structured common space for multimedia event extraction. arXiv preprint arXiv:2005.02472, 2020.
  28. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  29. Narrative question answering with cutting-edge open-domain qa techniques: A comprehensive study. Transactions of the Association for Computational Linguistics, 9:1032–1046, 2021.
  30. Kilt: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252, 2020.
  31. Wikiomnia: generative qa corpus on the whole russian wikipedia. arXiv preprint arXiv:2204.08009, 2022.
  32. Webbrain: Learning to generate factually correct articles for queries by grounding on large web corpus. arXiv preprint arXiv:2304.04358, 2023.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  35. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  36. Mumuqa: Multimedia multi-hop news question answering via cross-media knowledge extraction and grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11200–11208, 2022.
  37. V3c–a research video collection. In MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part I 25, pages 349–360. Springer, 2019.
  38. C2kd: Cross-lingual cross-modal knowledge distillation for multilingual text-video retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  39. Framenet ii: Extended theory and practice. Technical report, International Computer Science Institute, 2016.
  40. Piotr Rybak. Maupqa: Massive automatically-created polish question answering dataset. arXiv preprint arXiv:2305.05486, 2023.
  41. Ambiguous images with human judgments for robust visual event classification. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  42. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  43. Simple entity-centric questions challenge dense retrievers. arXiv preprint arXiv:2109.08535, 2021.
  44. A study in contradiction: Data and annotation for aida focusing on informational conflict in russia-ukraine relations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1831–1838, 2022.
  45. Object-aware video-language pre-training for retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3313–3322, 2022.
  46. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  47. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  48. Newsnet: A novel dataset for hierarchical temporal segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10669–10680, 2023.
  49. A large cross-modal video retrieval dataset with reading comprehension. arXiv preprint arXiv:2305.03347, 2023.
  50. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
  51. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  52. Videoqa: question answering on news video. In Proceedings of the eleventh ACM international conference on Multimedia, pages 632–641, 2003.
  53. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
Citations (4)

Summary

We haven't generated a summary for this paper yet.