Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Up Video Summarization Pretraining with Large Language Models (2404.03398v1)

Published 4 Apr 2024 in cs.CV

Abstract: Long-form video content constitutes a significant portion of internet traffic, making automated video summarization an essential research problem. However, existing video summarization datasets are notably limited in their size, constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent LLMs in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset, we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field, our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Recent progress on text summarization. In 2020 International Conference on Computational Science and Computational Intelligence (CSCI), pages 1503–1509. IEEE, 2020.
  2. Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747, 2023.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  5. See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE conference on computer vision and pattern recognition, pages 1600–1607. IEEE, 2012.
  6. A comprehensive survey on text summarization systems. In 2009 2nd International Conference on Computer Science and its Applications, pages 1–6. IEEE, 2009.
  7. Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, pages 505–520. Springer, 2014.
  8. Align and attend: Multimodal summarization with dual contrastive losses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14867–14878, 2023.
  9. Video summarization with attention-based encoder–decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, 30(6):1709–1717, 2019.
  10. Joint video summarization and moment localization by cross-task sample transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16388–16398, 2022.
  11. Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on artificial intelligence, pages 8537–8544, 2019.
  12. Aware video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7435–7444, 2018.
  13. Maurice G Kendall. The treatment of ties in ranking problems. Biometrika, 33(3):239–251, 1945.
  14. Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition, pages 1346–1353. IEEE, 2012.
  15. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  16. Videoxum: Cross-modal visual and textural summarization of videos. IEEE Transactions on Multimedia, 2023.
  17. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345, 2019.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  19. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  20. A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Transactions on Multimedia, 16(6):1497–1509, 2014.
  21. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 202–211, 2017.
  22. James Manyika. An overview of bard: an early experiment with generative ai, 2023.
  23. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  24. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pages 404–411, 2004.
  25. Clip-it! language-guided video summarization. Advances in Neural Information Processing Systems, 34:13988–14000, 2021.
  26. Tl; dw? summarizing instructional videos with task relevance and cross-modal saliency. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  27. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  28. Rethinking the evaluation of video summaries. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7596–7604, 2019.
  29. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation. arXiv preprint arXiv:2308.02828, 2023.
  30. Multisum: A dataset for multimodal summarization and thumbnail generation of videos. arXiv preprint arXiv:2306.04216, 2023.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  32. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  33. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
  34. Query-focused extractive video summarization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14, pages 3–19. Springer, 2016.
  35. Query-focused video summarization: Dataset, evaluation, and a memory network based approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4788–4797, 2017.
  36. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187, 2015.
  37. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  39. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  40. Element-aware summarization with large language models: Expert-aligned evaluation and chain-of-thought method. arXiv preprint arXiv:2305.13412, 2023.
  41. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR, 2020.
  42. Video summarization with long short-term memory. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 766–782. Springer, 2016.
  43. Retrospective encoders for video summarization. In Proceedings of the European conference on computer vision (ECCV), pages 383–399, 2018.
  44. Benchmarking large language models for news summarization. arXiv preprint arXiv:2301.13848, 2023.
  45. Quasi real-time summarization for consumer videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2513–2520, 2014.
  46. Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7405–7414, 2018.
  47. Reconstructive sequence-graph network for video summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(5):2793–2801, 2021.
  48. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335, 2023.
  49. Dsnet: A flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 30:948–962, 2020.
  50. CRC standard probability and statistics tables and formulae. Crc Press, 1999.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com