Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-modal News Understanding with Professionally Labelled Videos (ReutersViLNews) (2401.12419v1)

Published 23 Jan 2024 in cs.CV

Abstract: While progress has been made in the domain of video-language understanding, current state-of-the-art algorithms are still limited in their ability to understand videos at high levels of abstraction, such as news-oriented videos. Alternatively, humans easily amalgamate information from video and language to infer information beyond what is visually observable in the pixels. An example of this is watching a news story, where the context of the event can play as big of a role in understanding the story as the event itself. Towards a solution for designing this ability in algorithms, we present a large-scale analysis on an in-house dataset collected by the Reuters News Agency, called Reuters Video-Language News (ReutersViLNews) dataset which focuses on high-level video-language understanding with an emphasis on long-form news. The ReutersViLNews Dataset consists of long-form news videos collected and labeled by news industry professionals over several years and contains prominent news reporting from around the world. Each video involves a single story and contains action shots of the actual event, interviews with people associated with the event, footage from nearby areas, and more. ReutersViLNews dataset contains videos from seven subject categories: disaster, finance, entertainment, health, politics, sports, and miscellaneous with annotations from high-level to low-level, title caption, visual video description, high-level story description, keywords, and location. We first present an analysis of the dataset statistics of ReutersViLNews compared to previous datasets. Then we benchmark state-of-the-art approaches for four different video-language tasks. The results suggest that news-oriented videos are a substantial challenge for current video-language understanding algorithms and we conclude by providing future directions in designing approaches to solve the ReutersViLNews dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. “Youtube-8m: A large-scale video classification benchmark” In ArXiv, 2016
  2. Stephen J Adler “Reuters features in Economist study on accuracy and bias” Reuters, 2019
  3. “Bottom-up and top-down attention for image captioning and visual question answering” In CVPR, 2018
  4. “A CLIP-Hitchhiker’s Guide to Long Video Retrieval” In ArXiv, 2022
  5. “Quo vadis, action recognition? a new model and the kinetics dataset” In CVPR, 2017
  6. David Chen and William B Dolan “Collecting highly parallel data for paraphrase evaluation” In ACL: human language technologies, 2011
  7. “Motion guided spatial attention for video captioning” In AAAI, 2019
  8. “Fine-grained video-text retrieval with hierarchical graph reasoning” In CVPR, 2020
  9. Shih-Han Chou, James J Little and Leonid Sigal “Implicit and Explicit Commonsense for Multi-sentence Video Captioning” In ArXiv, 2023
  10. “Semi-supervised Grounding Alignment for Multi-modal Feature Learning” In CRV, 2022
  11. “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching” In CVPR, 2013
  12. “Meteor universal: Language specific translation evaluation for any target language” In Proceedings of the ninth workshop on statistical machine translation, 2014
  13. “Mdmmt: Multidomain multimodal transformer for video retrieval” In CVPR, 2021
  14. “Clip2video: Mastering video-text retrieval via image clip” In ArXiv, 2021
  15. “Multi-modal transformer for video retrieval” In ECCV, 2020
  16. Spandana Gella, Mike Lewis and Marcus Rohrbach “A dataset for telling the stories of social media videos” In EMNLP, 2018
  17. Tengda Han, Weidi Xie and Andrew Zisserman “Temporal alignment networks for long-term video” In CVPR, 2022
  18. “Deep residual learning for image recognition” In CVPR, 2016
  19. “A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer” In BMVC, 2020
  20. “Multi-Modal Dense Video Captioning” In CVPR Workshops, 2020
  21. Md Mohaiminul Islam and Gedas Bertasius “Long movie clip classification with state-space video models” In ECCV, 2022
  22. Atsuhiro Kojima, Takeshi Tamura and Kunio Fukunaga “Natural language description of human activities from video images based on concept hierarchy of actions” In IJCV, 2002
  23. Dieter Koller, Norbert Heinze and Hans-Hellmut Nagel “Algorithmic characterization of vehicle trajectories from image sequences by motion verbs.” In CVPR, 1991
  24. “Dense-captioning events in videos” In ICCV, 2017
  25. Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “Imagenet classification with deep convolutional neural networks” In ACM, 2017
  26. “Less is more: Clipbert for video-and-language learning via sparse sampling” In CVPR, 2021
  27. “Video paragraph captioning as a text summarization task” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2021
  28. “TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval” In ECCV, 2022
  29. “CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning” In Neurocomputing, 2022
  30. “HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips” In ICCV, 2019
  31. “Moments in time dataset: one million videos for event understanding” In IEEE transactions on pattern analysis and machine intelligence 42.2 IEEE, 2019, pp. 502–508
  32. “Bleu: a method for automatic evaluation of machine translation” In ACL, 2002
  33. Jeffrey Pennington, Richard Socher and Christopher D. Manning “GloVe: Global Vectors for Word Representation” In EMNLP, 2014 URL: http://www.aclweb.org/anthology/D14-1162
  34. “Learning transferable visual models from natural language supervision” In ICML, 2021
  35. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” In EMNLP, 2019
  36. “A dataset for movie description” In CVPR, 2015
  37. “End-to-end generative pretraining for multimodal video captioning” In CVPR, 2022
  38. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning” In ACL (Volume 1: Long Papers), 2018
  39. “Speaking the same language: Matching machine to human captions by adversarial training” In ICCV, 2017
  40. “Hollywood in homes: Crowdsourcing data collection for activity understanding” In ECCV, 2016 Springer
  41. Yuqing Song, Shizhe Chen and Qin Jin “Towards Diverse Paragraph Captioning for Untrimmed Videos” In CVPR, 2021
  42. “VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training” In NeurIPS, 2022
  43. “Visually explaining 3D-CNN predictions for video classification with an adaptive occlusion sensitivity analysis” In WACV, 2023
  44. “Sequence to sequence-video to text” In ICCV, 2015
  45. “Disentangled Representation Learning for Text-Video Retrieval” In ArXiv, 2022
  46. “Rethinking spatiotemporal feature learning for video understanding” In ArXiv, 2017
  47. Yilei Xiong, Bo Dai and Dahua Lin “Move forward and tell: A progressive generator of video descriptions” In ECCV, 2018
  48. “Msr-vtt: A large video description dataset for bridging video and language” In CVPR, 2016
  49. “Describing videos by exploiting temporal structure” In ICCV, 2015
  50. Youngjae Yu, Jongseok Kim and Gunhee Kim “A joint sequence fusion model for video question answering and retrieval” In ECCV, 2018
  51. “Title generation for user generated videos” In ECCV, 2016 Springer
  52. Chen-Lin Zhang, Jianxin Wu and Yin Li “ActionFormer: Localizing Moments of Actions with Transformers” In ECCV, 2022
  53. “BERTScore: Evaluating Text Generation with BERT” In ICLR, 2020
  54. Luowei Zhou, Chenliang Xu and Jason Corso “Towards automatic learning of procedures from web instructional videos” In AAAI, 2018
Citations (1)

Summary

We haven't generated a summary for this paper yet.