VidChapters-7M: Video Chapters at Scale (2309.13952v1)
Abstract: Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters in total. VidChapters-7M is automatically created from videos online in a scalable manner by scraping user-annotated chapters and hence without any additional manual annotation. We introduce the following three tasks based on this data. First, the video chapter generation task consists of temporally segmenting the video and generating a chapter title for each segment. To further dissect the problem, we also define two variants of this task: video chapter generation given ground-truth boundaries, which requires generating a chapter title given an annotated video segment, and video chapter grounding, which requires temporally localizing a chapter given its annotated title. We benchmark both simple baselines and state-of-the-art video-LLMs for these three tasks. We also show that pretraining on VidChapters-7M transfers well to dense video captioning tasks in both zero-shot and finetuning settings, largely improving the state of the art on the YouCook2 and ViTT benchmarks. Finally, our experiments reveal that downstream performance scales well with the size of the pretraining dataset. Our dataset, code, and models are publicly available at https://antoyang.github.io/vidchapters.html.
- VidChapters-7M project webpage. https://antoyang.github.io/vidchapters.html.
- VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 2021.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Automatic generation of descriptive titles for video clips using deep learning. In Advances in Artificial Intelligence and Applied Cognitive Computing: Proceedings from ICAI’20 and ACC’20, 2021.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
- WhisperX: Time-accurate speech transcription of long-form audio. In Interspeech, 2023.
- METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.
- Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation. In ECCV, 2022.
- Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., 2009.
- Multi-modal video chapter generation. In BMVC, 2022.
- End-to-end object detection with transformers. In ECCV, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Rethinking the Faster R-CNN architecture for temporal action localization. In CVPR, 2018.
- Shot contrastive self-supervised learning for scene boundary detection. In CVPR, 2021.
- UNITER: Universal image-text representation learning. In ECCV, 2020.
- TALLformer: Temporal action localization with long-memory transformer. In ECCV, 2022.
- Michal Danilák. Language detection library. https://github.com/Mimino666/langdetect, 2021.
- RedCaps: Web-curated image-text data created by the people, for the people. In NeurIPS Datasets and Benchmarks, 2021.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- MS-TCN: Multi-stage temporal convolutional network for action segmentation. In CVPR, 2019.
- SODA: Story oriented dense video captioning evaluation framework. In ECCV, 2020.
- Large-scale adversarial training for vision-and-language representation learning. In NeurIPS, 2020.
- TALL: Temporal activity localization via language query. In ICCV, 2017a.
- Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 2017b.
- Global2Local: Efficient structure search for video action segmentation. In CVPR, 2021.
- Bridging video-text retrieval with multiple choice questions. In CVPR, 2022.
- Datasheets for datasets. Communications of the ACM, 2021.
- Ego4D: Around the World in 3,000 Hours of Egocentric Video. In CVPR, 2022.
- Temporal alignment networks for long-term video. In CVPR, 2022.
- Detoxify. https://github.com/unitaryai/detoxify, 2020.
- Marti A Hearst. Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational linguistics, 1997.
- Localizing moments in video with natural language. ICCV, 2017.
- Localizing moments in video with temporal language. In EMNLP, 2018.
- Scaling up vision-language pre-training for image captioning. In CVPR, 2022.
- Multimodal pretraining for dense video captioning. In AACL-IJCNLP, 2020.
- Seeing out of the box: End-to-end pre-training for vision-language representation learning. In CVPR, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Guillaume Klein. faster-whisper library. https://github.com/guillaumekln/faster-whisper, 2023.
- Video-text representation learning via differentiable weak temporal alignment. In CVPR, 2022.
- Dense-captioning events in videos. In ICCV, 2017.
- Temporal convolutional networks for action segmentation and detection. In CVPR, 2017.
- TVR: A large-scale dataset for video-subtitle moment retrieval. In ECCV, 2020.
- Detecting moments and highlights in videos via natural language queries. In NeurIPS, 2021a.
- Less is more: ClipBERT for video-and-language learning via sparse sampling. In CVPR, 2021b.
- Align and prompt: Video-and-language pre-training with entity prompts. In CVPR, 2022a.
- Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020a.
- Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021a.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022b.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
- HERO: Hierarchical encoder for video+language omni-representation pre-training. In EMNLP, 2020b.
- LAVENDER: Unifying video-language understanding as masked language modeling. In CVPR, 2023b.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020c.
- Temporal action segmentation from timestamp supervision. In CVPR, 2021b.
- Chin-Yew Lin. Rouge: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS), 2004.
- SwinBERT: End-to-end transformers with sparse attention for video captioning. In CVPR, 2022a.
- Egocentric video-language pretraining. In NeurIPS, 2022b.
- End-to-end temporal action detection with transformer. In IEEE Transactions on Image Processing, 2022.
- Decoupled weight decay regularization. In ICLR, 2019.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
- 12-in-1: Multi-task vision and language representation learning. In CVPR, 2020.
- UniViLM: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353, 2020.
- HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
- Learning audio-video modalities from image captions. In ECCV, 2022.
- Interventional video grounding with dual contrastive learning. In CVPR, 2021.
- Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
- Video captioning with transferred semantic attributes. In CVPR, 2017.
- BLEU: a method for automatic evaluation of machine translation. In ACL, 2002.
- ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- A local-to-global approach to multi-modal movie scene segmentation. In CVPR, 2020.
- Scene detection in hollywood movies and tv shows. In CVPR, 2003.
- Exploring video structure beyond the shots. In IEEE International Conference on Multimedia Computing and Systems, 1998.
- LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
- Genbit: measure and mitigate gender bias in language datasets. Microsoft Journal of Applied Research, 2021.
- Look before you speak: Visually contextualized utterances. In CVPR, 2021.
- End-to-end generative pretraining for multimodal video captioning. In CVPR, 2022.
- Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
- Temporal action localization in untrimmed videos via multi-stage CNNs. In CVPR, 2016.
- Temporal video segmentation to scenes using high-level audiovisual features. IEEE TCSVT, 2011.
- FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In ACM SIGIR Conference on Research and Development in Information Retrieval, 2021.
- VL-BERT: Pre-training of generic visual-linguistic representations. In ICLR, 2019.
- VideoBERT: A joint model for video and language representation learning. In ICCV, 2019.
- Long-form video-language pre-training with multimodal temporal contrastive learning. In NeurIPS, 2022.
- LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
- TVLT: Textless vision-language transformer. In NeurIPS, 2022.
- Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
- CIDEr: Consensus-based image description evaluation. In CVPR, 2015.
- Bram Vijgen et al. The listicle: An exploring research on an interesting shareable new media phenomenon. Studia Universitatis Babes-Bolyai-Ephemerides, 59(1):103–122, 2014.
- All in one: Exploring unified video-language pre-training. In CVPR, 2023.
- Reconstruction network for video captioning. In CVPR, 2018a.
- GIT: A generative image-to-text transformer for vision and language. In TMLR, 2022a.
- Object-aware video-language pre-training for retrieval. In CVPR, 2022b.
- End-to-end dense video captioning with parallel decoding. In ICCV, 2021.
- Video captioning via hierarchical reinforcement learning. In CVPR, 2018b.
- GEB+: A benchmark for generic event boundary captioning, grounding and retrieval. In ECCV, 2022c.
- Boundary-aware cascade networks for temporal action segmentation. In ECCV, 2020.
- SimVLM: Simple visual language model pretraining with weak supervision. In ICLR, 2022d.
- Finetuned language models are zero-shot learners. In ICLR, 2022.
- VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
- Advancing high-resolution video-language representation with large-scale video transcriptions. In CVPR, 2022.
- mT5: A massively multilingual pre-trained text-to-text transformer. In NAACL, 2021.
- Just ask: Learning to answer questions from millions of narrated videos. In ICCV, 2021.
- Learning to answer visual questions from web videos. IEEE TPAMI, 2022a.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022b.
- TubeDETR: Spatio-temporal video grounding with transformers. In CVPR, 2022c.
- Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning. In CVPR, 2023.
- ERNIE-ViL: Knowledge enhanced vision-language representations through scene graph. In AAAI, 2020.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- MERLOT: Multimodal neural script knowledge models. In NeurIPS, 2021.
- MERLOT Reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Title generation for user generated videos. In ECCV, 2016.
- Graph convolutional networks for temporal action localization. In CVPR, 2019.
- ActionFormer: Localizing moments of actions with transformers. In ECCV, 2022.
- Span-based localizing network for natural language video localization. In ACL, 2020a.
- Comprehensive information integration modeling framework for video titling. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020b.
- Learning 2d temporal adjacent networks for moment localization with natural language. In AAAI, 2020c.
- Object relational graph with teacher-recommended learning for video captioning. In CVPR, 2020d.
- Learning video representations from large language models. In CVPR, 2023.
- Towards automatic learning of procedures from web instructional videos. In AAAI, 2018a.
- End-to-end dense video captioning with masked transformer. In CVPR, 2018b.
- Unified vision-language pre-training for image captioning and VQA. In AAAI, 2020.
- End-to-end dense video captioning as sequence generation. In COLING, 2022.
- Antoine Yang (12 papers)
- Arsha Nagrani (62 papers)
- Ivan Laptev (99 papers)
- Josef Sivic (78 papers)
- Cordelia Schmid (206 papers)