Scaling Up Video Summarization Pretraining with Large Language Models (2404.03398v1)
Abstract: Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent LLMs in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset, we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field, our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.
- Recent progress on text summarization. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI), pages 1503–1509. IEEE, 2020.
- Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747, 2023.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE conference on computer vision and pattern recognition, pages 1600–1607. IEEE, 2012.
- A comprehensive survey on text summarization systems. In 2009 2nd International Conference on Computer Science and its Applications, pages 1–6. IEEE, 2009.
- Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, pages 505–520. Springer, 2014.
- Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14867–14878, 2023.
- Video summarization with attention-based encoder–decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, 30(6):1709–1717, 2019.
- Joint video summarization and moment localization by cross-task sample transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16388–16398, 2022.
- Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on artificial intelligence, pages 8537–8544, 2019.
- Aware video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7435–7444, 2018.
- Maurice G Kendall. The treatment of ties in ranking problems. Biometrika, 33(3):239–251, 1945.
- Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition, pages 1346–1353. IEEE, 2012.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Videoxum: Cross-modal visual and textural summarization of videos. IEEE Transactions on Multimedia, 2023.
- Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345, 2019.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Transactions on Multimedia, 16(6):1497–1509, 2014.
- Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 202–211, 2017.
- James Manyika. An overview of bard: an early experiment with generative ai, 2023.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
- Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411, 2004.
- Clip-it! language-guided video summarization. Advances in Neural Information Processing Systems, 34:13988–14000, 2021.
- Tl; dw? summarizing instructional videos with task relevance and cross-modal saliency. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
- R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
- Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7596–7604, 2019.
- Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828, 2023.
- Multisum: A dataset for multimodal summarization and thumbnail generation of videos. arXiv preprint arXiv:2306.04216, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
- Query-focused extractive video summarization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 3–19. Springer, 2016.
- Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4788–4797, 2017.
- Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187, 2015.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. arXiv preprint arXiv:2305.13412, 2023.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020.
- Video summarization with long short-term memory. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 766–782. Springer, 2016.
- Retrospective encoders for video summarization. In Proceedings of the European conference on computer vision (ECCV), pages 383–399, 2018.
- Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023.
- Quasi real-time summarization for consumer videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2513–2520, 2014.
- Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7405–7414, 2018.
- Reconstructive sequence-graph network for video summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2793–2801, 2021.
- A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335, 2023.
- Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 30:948–962, 2020.
- CRC standard probability and statistics tables and formulae. Crc Press, 1999.