Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VidChapters-7M: Video Chapters at Scale (2309.13952v1)

Published 25 Sep 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-LLMs for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (130)
  1. VidChapters-7M project webpage. https://antoyang.github.io/vidchapters.html.
  2. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 2021.
  3. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  4. Automatic generation of descriptive titles for video clips using deep learning. In Advances in Artificial Intelligence and Applied Cognitive Computing: Proceedings from ICAI’20 and ACC’20, 2021.
  5. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  6. WhisperX: Time-accurate speech transcription of long-form audio. In Interspeech, 2023.
  7. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.
  8. Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In ECCV, 2022.
  9. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., 2009.
  10. Multi-modal video chapter generation. In BMVC, 2022.
  11. End-to-end object detection with transformers. In ECCV, 2020.
  12. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  13. Rethinking the Faster R-CNN architecture for temporal action localization. In CVPR, 2018.
  14. Shot contrastive self-supervised learning for scene boundary detection. In CVPR, 2021.
  15. UNITER: Universal image-text representation learning. In ECCV, 2020.
  16. TALLformer: Temporal action localization with long-memory transformer. In ECCV, 2022.
  17. Michal Danilák. Language detection library. https://github.com/Mimino666/langdetect, 2021.
  18. RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
  19. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  21. MS-TCN: Multi-stage temporal convolutional network for action segmentation. In CVPR, 2019.
  22. SODA: Story oriented dense video captioning evaluation framework. In ECCV, 2020.
  23. Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, 2020.
  24. TALL: Temporal activity localization via language query. In ICCV, 2017a.
  25. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 2017b.
  26. Global2Local: Efficient structure search for video action segmentation. In CVPR, 2021.
  27. Bridging video-text retrieval with multiple choice questions. In CVPR, 2022.
  28. Datasheets for datasets. Communications of the ACM, 2021.
  29. Ego4D: Around the World in 3,000 Hours of Egocentric Video. In CVPR, 2022.
  30. Temporal alignment networks for long-term video. In CVPR, 2022.
  31. Detoxify. https://github.com/unitaryai/detoxify, 2020.
  32. Marti A Hearst. Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 1997.
  33. Localizing moments in video with natural language. ICCV, 2017.
  34. Localizing moments in video with temporal language. In EMNLP, 2018.
  35. Scaling up vision-language pre-training for image captioning. In CVPR, 2022.
  36. Multimodal pretraining for dense video captioning. In AACL-IJCNLP, 2020.
  37. Seeing out of the box: End-to-end pre-training for vision-language representation learning. In CVPR, 2021.
  38. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  39. Adam: A method for stochastic optimization. In ICLR, 2015.
  40. Guillaume Klein. faster-whisper library. https://github.com/guillaumekln/faster-whisper, 2023.
  41. Video-text representation learning via differentiable weak temporal alignment. In CVPR, 2022.
  42. Dense-captioning events in videos. In ICCV, 2017.
  43. Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
  44. TVR: A large-scale dataset for video-subtitle moment retrieval. In ECCV, 2020.
  45. Detecting moments and highlights in videos via natural language queries. In NeurIPS, 2021a.
  46. Less is more: ClipBERT for video-and-language learning via sparse sampling. In CVPR, 2021b.
  47. Align and prompt: Video-and-language pre-training with entity prompts. In CVPR, 2022a.
  48. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020a.
  49. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021a.
  50. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
  51. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  52. HERO: Hierarchical encoder for video+language omni-representation pre-training. In EMNLP, 2020b.
  53. LAVENDER: Unifying video-language understanding as masked language modeling. In CVPR, 2023b.
  54. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020c.
  55. Temporal action segmentation from timestamp supervision. In CVPR, 2021b.
  56. Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS), 2004.
  57. SwinBERT: End-to-end transformers with sparse attention for video captioning. In CVPR, 2022a.
  58. Egocentric video-language pretraining. In NeurIPS, 2022b.
  59. End-to-end temporal action detection with transformer. In IEEE Transactions on Image Processing, 2022.
  60. Decoupled weight decay regularization. In ICLR, 2019.
  61. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
  62. 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020.
  63. UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
  64. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  65. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
  66. Learning audio-video modalities from image captions. In ECCV, 2022.
  67. Interventional video grounding with dual contrastive learning. In CVPR, 2021.
  68. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
  69. Video captioning with transferred semantic attributes. In CVPR, 2017.
  70. BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.
  71. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
  72. Learning transferable visual models from natural language supervision. In ICML, 2021.
  73. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
  74. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  75. A local-to-global approach to multi-modal movie scene segmentation. In CVPR, 2020.
  76. Scene detection in hollywood movies and tv shows. In CVPR, 2003.
  77. Exploring video structure beyond the shots. In IEEE International Conference on Multimedia Computing and Systems, 1998.
  78. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  79. Genbit: measure and mitigate gender bias in language datasets. Microsoft Journal of Applied Research, 2021.
  80. Look before you speak: Visually contextualized utterances. In CVPR, 2021.
  81. End-to-end generative pretraining for multimodal video captioning. In CVPR, 2022.
  82. Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  83. Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR, 2016.
  84. Temporal video segmentation to scenes using high-level audiovisual features. IEEE TCSVT, 2011.
  85. FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
  86. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In ACM SIGIR Conference on Research and Development in Information Retrieval, 2021.
  87. VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR, 2019.
  88. VideoBERT: A joint model for video and language representation learning. In ICCV, 2019.
  89. Long-form video-language pre-training with multimodal temporal contrastive learning. In NeurIPS, 2022.
  90. LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
  91. TVLT: Textless vision-language transformer. In NeurIPS, 2022.
  92. Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006.
  93. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  94. Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
  95. CIDEr: Consensus-based image description evaluation. In CVPR, 2015.
  96. Bram Vijgen et al. The listicle: An exploring research on an interesting shareable new media phenomenon. Studia Universitatis Babes-Bolyai-Ephemerides, 59(1):103–122, 2014.
  97. All in one: Exploring unified video-language pre-training. In CVPR, 2023.
  98. Reconstruction network for video captioning. In CVPR, 2018a.
  99. GIT: A generative image-to-text transformer for vision and language. In TMLR, 2022a.
  100. Object-aware video-language pre-training for retrieval. In CVPR, 2022b.
  101. End-to-end dense video captioning with parallel decoding. In ICCV, 2021.
  102. Video captioning via hierarchical reinforcement learning. In CVPR, 2018b.
  103. GEB+: A benchmark for generic event boundary captioning, grounding and retrieval. In ECCV, 2022c.
  104. Boundary-aware cascade networks for temporal action segmentation. In ECCV, 2020.
  105. SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022d.
  106. Finetuned language models are zero-shot learners. In ICLR, 2022.
  107. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
  108. Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
  109. mT5: A massively multilingual pre-trained text-to-text transformer. In NAACL, 2021.
  110. Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
  111. Learning to answer visual questions from web videos. IEEE TPAMI, 2022a.
  112. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022b.
  113. TubeDETR: Spatio-temporal video grounding with transformers. In CVPR, 2022c.
  114. Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
  115. ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. In AAAI, 2020.
  116. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  117. MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
  118. MERLOT Reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
  119. Title generation for user generated videos. In ECCV, 2016.
  120. Graph convolutional networks for temporal action localization. In CVPR, 2019.
  121. ActionFormer: Localizing moments of actions with transformers. In ECCV, 2022.
  122. Span-based localizing network for natural language video localization. In ACL, 2020a.
  123. Comprehensive information integration modeling framework for video titling. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020b.
  124. Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020c.
  125. Object relational graph with teacher-recommended learning for video captioning. In CVPR, 2020d.
  126. Learning video representations from large language models. In CVPR, 2023.
  127. Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
  128. End-to-end dense video captioning with masked transformer. In CVPR, 2018b.
  129. Unified vision-language pre-training for image captioning and VQA. In AAAI, 2020.
  130. End-to-end dense video captioning as sequence generation. In COLING, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Antoine Yang (12 papers)
  2. Arsha Nagrani (62 papers)
  3. Ivan Laptev (99 papers)
  4. Josef Sivic (78 papers)
  5. Cordelia Schmid (206 papers)
Citations (19)

Summary

  • The paper presents VidChapters-7M, a dataset of over 7M user-annotated chapters from 817K videos that enhances video chapter generation methods.
  • It utilizes a multi-modal approach combining visual, speech, and audio cues to segment videos and generate coherent chapter titles.
  • The dataset boosts performance in dense video captioning and opens new research avenues in multimodal learning and video understanding.

An Expert Overview of "VidChapters-7M: Video Chapters at Scale"

The academic paper "VidChapters-7M: Video Chapters at Scale" introduces a significant dataset designed to facilitate the analysis and segmentation of long video content by automatically assigning chapters. The work addresses a growing need for methodologies that allow efficient navigation and content discovery within lengthy video material, a task complicated by the scarcity of publicly available datasets for such purposes.

Dataset Description

VidChapters-7M is a large-scale dataset comprised of 817,000 user-chaptered videos and over 7 million chapters in total. These data were automatically curated by scraping user-annotated chapters from online videos, without any manual labeling process. The dataset incorporates various video categories (including instructional, review, and music compilation videos) and different modalities like speech transcripts and chapter annotations, making it richly diverse and voluminous.

Key Contributions and Approach

The authors delineate three distinct tasks introduced within the context of the VidChapters-7M dataset:

  1. Video Chapter Generation: This involves temporally segmenting videos and generating a chapter title for each segment proposed through either simple baselines or advanced video-LLMs. Notably, this task combines multiple modalities, requiring algorithms to synthesize visual, speech, and potentially audio cues into coherent chapter annotations.
  2. Video Chapter Generation with Given Ground-Truth Boundaries: This task singles out the requirement of generating appropriate titles for video segments, assuming predefined temporal boundaries, thus allowing an isolated focus on the LLMing aspect.
  3. Video Chapter Grounding: Here, the challenge is to temporally localize a chapter given an annotated title, which necessitates precise navigation through visual and potentially audio-visual information.

Experimental Results and Insights

Benchmarking experiments involving state-of-the-art methods like Vid2Seq demonstrate a significant leap in performance when models are pretrained on VidChapters-7M. Such pretraining leads to an improved ability to perform dense video captioning, both in zero-shot settings and when fine-tuned, and shows promise in elevating the current state of the art on established benchmarks like YouCook2 and ViTT. Notably, the efficacy of the chapter generation models scales well with dataset size.

Theoretical and Practical Implications

Practically, this dataset enables more intricate and context-aware video processing systems, allowing both researchers and industry practitioners to develop applications that necessitate sophisticated video navigation, such as educational platforms or content recommendation systems. Theoretically, the introduction of VidChapters-7M poses new challenges and opportunities in the field of multi-modal learning and video-signal processing, opening up questions for further exploration regarding cross-modal learning efficiency and representation.

Future Directions

The implications of this work suggest several future research avenues. Beyond streamlining state-of-the-art video chapter generation tasks, the dataset can lay the groundwork for concurrent investigation into how visual, auditory, and textual features can be more seamlessly integrated within AI models. It can also serve as a pretraining foundation for a broader array of video understanding tasks—potentially revolutionizing applications as diverse as automatic summarization, content moderation, and video-centric question answering.

In conclusion, "VidChapters-7M" supports a pivotal infrastructure in empowering large scale and diverse video chaptering and suggests a promising direction for advancing the integration of multi-modal AI systems. As the dataset and its methodologies gain traction, they hold substantial potential to impact both academic exploration and practical deployment of intelligent video systems.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com