Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews) (2401.12419v1)
Abstract: While progress has been made in the domain of video-language understanding, current state-of-the-art algorithms are still limited in their ability to understand videos at high levels of abstraction, such as news-oriented videos. Alternatively, humans easily amalgamate information from video and language to infer information beyond what is visually observable in the pixels. An example of this is watching a news story, where the context of the event can play as big of a role in understanding the story as the event itself. Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news. The ReutersViLNews Dataset consists of long-form news videos collected and labeled by news industry professionals over several years and contains prominent news reporting from around the world. Each video involves a single story and contains action shots of the actual event, interviews with people associated with the event, footage from nearby areas, and more. ReutersViLNews dataset contains videos from seven subject categories: disaster, finance, entertainment, health, politics, sports, and miscellaneous with annotations from high-level to low-level, title caption, visual video description, high-level story description, keywords, and location. We first present an analysis of the dataset statistics of ReutersViLNews compared to previous datasets. Then we benchmark state-of-the-art approaches for four different video-language tasks. The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms and we conclude by providing future directions in designing approaches to solve the ReutersViLNews dataset.
- “Youtube-8m: A large-scale video classification benchmark” In ArXiv, 2016
- Stephen J Adler “Reuters features in Economist study on accuracy and bias” Reuters, 2019
- “Bottom-up and top-down attention for image captioning and visual question answering” In CVPR, 2018
- “A CLIP-Hitchhiker’s Guide to Long Video Retrieval” In ArXiv, 2022
- “Quo vadis, action recognition? a new model and the kinetics dataset” In CVPR, 2017
- David Chen and William B Dolan “Collecting highly parallel data for paraphrase evaluation” In ACL: human language technologies, 2011
- “Motion guided spatial attention for video captioning” In AAAI, 2019
- “Fine-grained video-text retrieval with hierarchical graph reasoning” In CVPR, 2020
- Shih-Han Chou, James J Little and Leonid Sigal “Implicit and Explicit Commonsense for Multi-sentence Video Captioning” In ArXiv, 2023
- “Semi-supervised Grounding Alignment for Multi-modal Feature Learning” In CRV, 2022
- “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching” In CVPR, 2013
- “Meteor universal: Language specific translation evaluation for any target language” In Proceedings of the ninth workshop on statistical machine translation, 2014
- “Mdmmt: Multidomain multimodal transformer for video retrieval” In CVPR, 2021
- “Clip2video: Mastering video-text retrieval via image clip” In ArXiv, 2021
- “Multi-modal transformer for video retrieval” In ECCV, 2020
- Spandana Gella, Mike Lewis and Marcus Rohrbach “A dataset for telling the stories of social media videos” In EMNLP, 2018
- Tengda Han, Weidi Xie and Andrew Zisserman “Temporal alignment networks for long-term video” In CVPR, 2022
- “Deep residual learning for image recognition” In CVPR, 2016
- “A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer” In BMVC, 2020
- “Multi-Modal Dense Video Captioning” In CVPR Workshops, 2020
- Md Mohaiminul Islam and Gedas Bertasius “Long movie clip classification with state-space video models” In ECCV, 2022
- Atsuhiro Kojima, Takeshi Tamura and Kunio Fukunaga “Natural language description of human activities from video images based on concept hierarchy of actions” In IJCV, 2002
- Dieter Koller, Norbert Heinze and Hans-Hellmut Nagel “Algorithmic characterization of vehicle trajectories from image sequences by motion verbs.” In CVPR, 1991
- “Dense-captioning events in videos” In ICCV, 2017
- Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “Imagenet classification with deep convolutional neural networks” In ACM, 2017
- “Less is more: Clipbert for video-and-language learning via sparse sampling” In CVPR, 2021
- “Video paragraph captioning as a text summarization task” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021
- “TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval” In ECCV, 2022
- “CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning” In Neurocomputing, 2022
- “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips” In ICCV, 2019
- “Moments in time dataset: one million videos for event understanding” In IEEE transactions on pattern analysis and machine intelligence 42.2 IEEE, 2019, pp. 502–508
- “Bleu: a method for automatic evaluation of machine translation” In ACL, 2002
- Jeffrey Pennington, Richard Socher and Christopher D. Manning “GloVe: Global Vectors for Word Representation” In EMNLP, 2014 URL: http://www.aclweb.org/anthology/D14-1162
- “Learning transferable visual models from natural language supervision” In ICML, 2021
- “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” In EMNLP, 2019
- “A dataset for movie description” In CVPR, 2015
- “End-to-end generative pretraining for multimodal video captioning” In CVPR, 2022
- “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning” In ACL (Volume 1: Long Papers), 2018
- “Speaking the same language: Matching machine to human captions by adversarial training” In ICCV, 2017
- “Hollywood in homes: Crowdsourcing data collection for activity understanding” In ECCV, 2016 Springer
- Yuqing Song, Shizhe Chen and Qin Jin “Towards Diverse Paragraph Captioning for Untrimmed Videos” In CVPR, 2021
- “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training” In NeurIPS, 2022
- “Visually explaining 3D-CNN predictions for video classification with an adaptive occlusion sensitivity analysis” In WACV, 2023
- “Sequence to sequence-video to text” In ICCV, 2015
- “Disentangled Representation Learning for Text-Video Retrieval” In ArXiv, 2022
- “Rethinking spatiotemporal feature learning for video understanding” In ArXiv, 2017
- Yilei Xiong, Bo Dai and Dahua Lin “Move forward and tell: A progressive generator of video descriptions” In ECCV, 2018
- “Msr-vtt: A large video description dataset for bridging video and language” In CVPR, 2016
- “Describing videos by exploiting temporal structure” In ICCV, 2015
- Youngjae Yu, Jongseok Kim and Gunhee Kim “A joint sequence fusion model for video question answering and retrieval” In ECCV, 2018
- “Title generation for user generated videos” In ECCV, 2016 Springer
- Chen-Lin Zhang, Jianxin Wu and Yin Li “ActionFormer: Localizing Moments of Actions with Transformers” In ECCV, 2022
- “BERTScore: Evaluating Text Generation with BERT” In ICLR, 2020
- Luowei Zhou, Chenliang Xu and Jason Corso “Towards automatic learning of procedures from web instructional videos” In AAAI, 2018