Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

"Previously on ..." From Recaps to Story Summarization (2405.11487v1)

Published 19 May 2024 in cs.CV

Abstract: We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (123)
  1. 24 (TV series, IMDb). https://www.imdb.com/title/tt0285331/, 2001.
  2. Prison Break (IMDb). https://www.imdb.com/title/tt0455275/, 2005.
  3. Common Crawl - News. https://commoncrawl.org/blog/news-dataset-available, 2016.
  4. OpenWebText. https://github.com/jcpeterson/openwebtext?tab=readme-ov-file, 2019.
  5. English Wikipedia. https://en.wikipedia.org/wiki/English_Wikipedia, 2021 – Present.
  6. Combining Global and Local Attention with Positional Encoding for Video Summarization. In IEEE International Symposium on Multimedia (ISM), 2021.
  7. An overview on the evaluated video retrieval tasks at TRECVID 2022. In Proceedings of TRECVID, 2022.
  8. Layer Normalization. arXiv: 1607.06450, 2016.
  9. Condensed Movies: Story Based Retrieval with Contextual Embeddings, 2020.
  10. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  11. SummScreen: A Dataset for Abstractive Screenplay Summarization. In Association of Computational Linguistics (ACL), 2022.
  12. Movies2Scenes: Using movie metadata to learn scene representation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  13. A Flexible Model for Training Action Localization with Varying Levels of Supervision. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  14. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 32(1):56–68, 2011.
  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2018.
  16. Multiscale vision transformers. In International Conference on Computer Vision (ICCV), 2021.
  17. Video Highlight Prediction Using Audience Chat Reactions. In Empirical Methods in Natural Language Processing (EMNLP), 2017.
  18. Attentive and adversarial learning for video summarization. In Winter Conference on Applications of Computer Vision (WACV), 2019.
  19. MM-AVS: A Full-Scale Dataset for Multi-modal Summarization. In North American Chapter of Association of Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021.
  20. Supervised Video Summarization Via Multiple Feature Sets with Parallel Attention. IEEE International Conference on Multimedia and Expo (ICME), pages 1–6s, 2021.
  21. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, 2019.
  22. Schematic Storyboarding for Video Visualization and Editing. ACM Transactions on Graphics, 25(3):862–871, 2006.
  23. Creating summaries from user videos. In European Conference on Computer Vision (ECCV). Springer, 2014a.
  24. Creating summaries from user videos. In European Conference on Computer Vision (ECCV), 2014b.
  25. Video summarization by learning submodular mixtures of objectives. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  26. Intro and Recap Detection for Movies and TV Series. In Winter Conference on Applications of Computer Vision (WACV), 2021.
  27. Naming TV characters by watching and analyzing dialogs. In Winter Conference on Applications of Computer Vision (WACV), 2016.
  28. Align and Attend: Multimodal Summarization with Dual Contrastive Losses. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  29. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  30. Improving neural networks by preventing co-adaptation of feature detectors. ArXiv, abs/1207.0580, 2012.
  31. Summarizing First-Person Videos from Third Persons’ Points of Views. In European Conference on Computer Vision (ECCV), 2018.
  32. Long Short-Term Memory. Neural Computation, 9(8):1735–1780, 1997.
  33. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning (ICML), 2019.
  34. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  35. Query-controllable Video Summarization. In International Conference on Multimedia Retrieval (ICMR), 2020.
  36. Efficient Movie Scene Detection using State-Space Transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  37. Long movie clip classification with state-space video models. In European Conference on Computer Vision (ECCV), 2022.
  38. Neural Extractive Summarization with Hierarchical Attentive Heterogeneous Graph Network. In Empirical Methods in Natural Language Processing (EMNLP), 2020.
  39. Video Skimming: Taxonomy and Comprehensive Survey. ACM Comput. Surv., 2019.
  40. The Kinetics Human Action Video Dataset. CoRR, abs/1705.06950, 2017.
  41. Grounded Video Situation Recognition. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  42. Large-Scale Video Summarization Using Web-Image Priors. In Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  43. Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  44. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. In International Conference on Machine Learning, 2021.
  45. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
  46. First-Person Hyper-Lapse Videos. ACM Transactions on Graphics, 2014.
  47. Learning multiple layers of features from tiny images. cs.toronto.edu, 2009.
  48. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2–3):123–286, 2012.
  49. Discovering important people and objects for egocentric video summarization. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  50. TVQA: Localized, Compositional Video Question Answering. In Empirical Methods in Natural Language Processing (EMNLP), 2018.
  51. TVQA+: Spatio-Temporal Grounding for Video Question Answering. In Association of Computational Linguistics (ACL), 2020.
  52. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, 2012.
  53. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Association of Computational Linguistics (ACL), 2020.
  54. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, 2022.
  55. VideoXum: Cross-modal Visual and Textural Summarization of Videos. IEEE Transactions on Multimedia, 2023.
  56. Text Summarization with Pretrained Encoders. In Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  57. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR), 2017a.
  58. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR), 2017b.
  59. Story-Driven Summarization for Egocentric Video. In Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  60. Similarity Based Block Sparse Subset Selection for Video Summarization. IEEE Transactions on Circuits and Systems for Video Technology, 31(10):3967–3980, 2021.
  61. From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script. In British Machine Vision Conference (BMVC), 2017.
  62. Learnable PINs: Cross-Modal Embeddings for Person Identity. In European Conference on Computer Vision (ECCV), 2018.
  63. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Computational Natural Language Learning (CoNLL), 2016a.
  64. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Computational Natural Language Learning (CoNLL), 2016b.
  65. Clip-It! Language-guided Video Summarization. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  66. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Empirical Methods in Natural Language Processing (EMNLP), 2018.
  67. Rethinking the Evaluation of Video Summaries. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  68. Hierarchical3D Adapters for Long Video-to-text Summarization. arXiv:2210.04829, 2022.
  69. Movie Plot Analysis via Turning Point Identification. In Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  70. Screenplay Summarization Using Latent Narrative Structure. In Association of Computational Linguistics (ACL), 2020.
  71. Film trailer generation via task decomposition. arXiv preprint arXiv:2111.08774, 2021a.
  72. Movie Summarization via Sparse Graph Construction. In Association for the Advancement of Artificial Intelligence (AAAI), 2021b.
  73. Identity-Aware Multi-Sentence Video Description. In European Conference on Computer Vision (ECCV), 2020.
  74. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  75. M-VAD Names: A Dataset for Video Captioning with Naming. Multimedia Tools Appl., page 14007–14027, 2019.
  76. Nonchronological Video Synopsis and Indexing. In IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2008.
  77. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  78. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). PMLR, 2021.
  79. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2020.
  80. Video summarization by learning from unpaired data. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  81. Coherent Multi-sentence Video Description with Variable Level of Detail. In German Conference on Pattern Recognition (GCPR), 2014.
  82. A dataset for Movie Description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  83. Movie Description. International Journal of Computer Vision (IJCV), 123(1):94–120, 2017.
  84. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pages 211–252, 2015.
  85. Visual Semantic Role Labeling for Video Understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  86. How2: A Large-scale Dataset for Multimodal Language Understanding. In Advances in Neural Information Processing Systems-Workshop (NeurIPS-W), 2018.
  87. Subtitle-free Movie to Script Alignment. In British Machine Vision Conference (BMVC), 2009.
  88. Multiple Pairwise Ranking Networks for Personalized Video Summarization. In International Conference on Computer Vision (ICCV), 2021.
  89. FLAVA: A foundational language and vision alignment model. In CVPR, 2022.
  90. Leslie N. Smith. Cyclical Learning Rates for Training Neural Networks. In Winter Conference on Applications of Computer Vision (WACV), 2015.
  91. Super-convergence: very fast training of neural networks using large learning rates. In Defense + Commercial Sensing, 2018.
  92. MPNet: Masked and Permuted Pre-Training for Language Understanding. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  93. TVSum: Summarizing web videos using titles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  94. How you feelin’? Learning Emotions and Mental States in Movie Scenes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  95. Salient montages from unconstrained videos. In European Conference on Computer Vision (ECCV), 2014.
  96. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  97. Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors. In Association of Computational Linguistics (ACL), 2023.
  98. Story-Based Video Retrieval in TV Series Using Plot Synopses. In International Conference on Multimedia Retrieval (ICMR), 2014a.
  99. StoryGraphs: Visualizing Character Interactions as a Timeline. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014b.
  100. Aligning plot synopses to videos for story-based retrieval. International Journal of Multimedia Information Retrieval (IJMIR), 4(1):3–16, 2015.
  101. MovieQA: Understanding Stories in Movies through Question-Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  102. Tengda Han and Max Bain and Arsha Nagrani and Gül Varol and Weidi Xie and Andrew Zisserman. AutoAD II: The Sequel - Who, When, and What in Movie Audio Description. In International Conference on Computer Vision (ICCV), 2023.
  103. A Simple Method for Commonsense Reasoning. ArXiv, abs/1806.02847, 2018.
  104. Attention is All you Need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  105. MovieGraphs: Towards Understanding Human-Centric Situations from Videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  106. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  107. W. Wolf. Key Frame Selection by Motion Analysis. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1996.
  108. Hierarchical Self-supervised Representation Learning for Movie Understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  109. A Study on Automatic Shot Change Detection. In European Conference on Multimedia Applications, Services and Techniques, 1998.
  110. Netzer Yuval. Reading digits in Natural Images with Unsupervised Feature Learning. In Advances in Neural Information Processing Systems-Workshop (NeurIPS-W), 2011.
  111. Title Generation for User Generated Videos. In European Conference on Computer Vision (ECCV), 2016.
  112. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning (ICML). PMLR, 2020.
  113. Summary Transfer: Exemplar-based Subset Selection for Video Summarization. CoRR, abs/1603.03369, 2016a.
  114. Video summarization with long short-term memory. In European Conference on Computer Vision (ECCV), 2016b.
  115. Retrospective encoders for video summarization. In European Conference on Computer Vision (ECCV), 2018.
  116. DTR-GAN: Dilated temporal relational adversarial network for video summarization. In Proceedings of the ACM Turing Celebration Conference-China, 2019.
  117. Extractive Summarization as Text Matching. In Association of Computational Linguistics (ACL), 2020.
  118. MSMO: Multimodal Summarization with Multimodal Output. In Empirical Methods in Natural Language Processing (EMNLP), 2018.
  119. DSNet: A Flexible Detect-to-Summarize Network for Video Summarization. IEEE Transactions on Image Processing, 30:948–962, 2021.
  120. Relational reasoning over spatial-temporal graphs for video summarization. IEEE Transactions on Image Processing, 31:3017–3031, 2022.
  121. Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. In arXiv preprint arXiv:1506.06724, 2015a.
  122. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In International Conference on Computer Vision (ICCV), 2015b.
  123. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Aditya Kumar Singh (4 papers)
  2. Dhruv Srivastava (6 papers)
  3. Makarand Tapaswi (41 papers)

Summary

Multimodal Story Summarization through TV Show Recaps

The paper "“Previously on …” From Recaps to Story Summarization" presents a novel approach to story summarization in multimedia by leveraging TV series recaps. The authors introduce PlotSnap, a dataset encompassing two crime thriller TV series, which harnesses these recaps as critical elements for story summarization. The work expands on traditional summarization tasks by incorporating multimodal inputs, processing entire episodes to generate video-text story summaries rather than single-modality abstractions.

Synopsis of Contributions

The authors make the following notable contributions:

  1. PlotSnap Dataset: By focusing on crime thrillers with rich narrative structures and engaging recaps such as TV series "24" and "Prison Break", the dataset bridges a significant gap in existing resources by supporting multimodal story summarization.
  2. StoryNarrator Model: This hierarchical model enables the processing of multimodal data by creating compact representations and predicting importance scores across video shots and dialog utterances. The design supports interactions within local story groups and across an entire episode, emphasizing both context and narrative significance.
  3. Algorithmic Innovation: The paper introduces a shot-matching algorithm that aligns recap shots with episode sub-stories. This technique extends the utilization of recaps beyond memory aids to becoming an integral part of the storytelling framework.
  4. Cross-Modality Evaluation: StoryNarrator demonstrates effective cross-series generalization, extending its applicability beyond initial test series. It performs well on traditional video summarization benchmarks and showcases robustness across varying narratives and genres.

Empirical Evaluation

The model’s performance is evaluated on both traditional benchmark datasets like SumMe and TVSum and the novel PlotSnap dataset. StoryNarrator surpasses state-of-the-art methods in producing multimodal summaries, achieving high Average Precision scores for both video and dialog predictions. This numerical evidence underlines the efficacy of utilizing recaps as comprehensive labels.

Implications and Future Directions

Theoretical Implications: The paper presents a shift in how story summarization can be approached within computational paradigms. By treating video-text summarization as a multimodal interaction space, it challenges existing models that predominantly operate within a single modality.

Practical Implications: For industries such as content streaming and video archiving, the proposed framework can optimize user engagement and enhance retrieval of story content through succinct, context-rich summaries.

Speculation on Future Developments: Future research could expand into exploring recaps in genres beyond thrillers. Additionally, integration with long-form video understanding frameworks may further enhance the summarization task. Moreover, considering the significant strides in LLMs, integrating models like GPT or BERT with video-based frameworks could offer even richer, context-aware summaries.

Conclusion

The approach delineated in this paper underscores the potential of using TV recaps beyond mere narrative aids. The introduction of PlotSnap and the StoryNarrator model sets a precedence for multimodal processing within storytelling contexts, providing a robust framework for future research directions, aligning multimedia processing with advanced AI methodologies.

Youtube Logo Streamline Icon: https://streamlinehq.com