A Modular Approach for Multimodal Summarization of TV Shows
Abstract: In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PRISMA (Precision and Recall EvaluatIon of Summary FActs), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset, our method produces higher quality summaries than comparison models, as measured with ROUGE and our new fact-based metric, and as assessed by human evaluators.
- CREATIVESUMM: Shared task on automatic summarization for creative writing. In Proceedings of The Workshop on Automatic Summarization for Creative Writing, pages 67–73, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625.
- A video is worth 4096 tokens: Verbalize videos to understand them in zero shot. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9822–9839, Singapore. Association for Computational Linguistics.
- Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785.
- David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200.
- Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029.
- SummScreen: A dataset for abstractive screenplay summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8602–8615, Dublin, Ireland. Association for Computational Linguistics.
- SEAHORSE: A multilingual, multifaceted dataset for summarization evaluation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9397–9413, Singapore. Association for Computational Linguistics.
- GO FIGURE: A meta evaluation of factuality in summarization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 478–487, Online. Association for Computational Linguistics.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
- Philip John Gorinski and Mirella Lapata. 2015. Movie script summarization as graph-based scene extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1066–1076, Denver, Colorado. Association for Computational Linguistics.
- Peter Grunwald. 1998. The minimum description length principle and reasoning under uncertainty. University of Amsterdam.
- Peter D Grünwald. 2007. The minimum description length principle. MIT press.
- Shuo Guan and Vishakh Padmakumar. 2023. Extract, select and rewrite: A modular sentence summarization method. In Proceedings of the 4th New Frontiers in Summarization Workshop, pages 41–48, Singapore. Association for Computational Linguistics.
- Tanmay Gupta and Aniruddha Kembhavi. 2022. Visual programming: Compositional visual reasoning without training. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14953–14962.
- Teaching machines to read and comprehend. In NIPS, pages 1693–1701.
- Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Lear ning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799, Long Beach, California, USA. PMLR.
- Tadao Kasami. 1966. An efficient recognition and syntax-analysis algorithm for context-free languages. Coordinated Science Laboratory Report no. R-257.
- LongEval: Guidelines for human evaluation of faithfulness in long-form summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1650–1669, Dubrovnik, Croatia. Association for Computational Linguistics.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
- Analyzing sentence fusion in abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 104–110, Hong Kong, China. Association for Computational Linguistics.
- MART: Memory-augmented recurrent transformer for coherent video paragraph captioning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2603–2614, Online. Association for Computational Linguistics.
- TVQA: Localized, compositional video question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1369–1379, Brussels, Belgium. Association for Computational Linguistics.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17949–17958.
- Violin: A large-scale dataset for video-and-language inference. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10897–10907.
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Cory S Myers and Lawrence R Rabiner. 1981. A comparative study of several dynamic time-warping algorithms for connected-word recognition. Bell System Technical Journal, 60(7):1389–1409.
- Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879.
- Kosmos-g: Generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992.
- Long document summarization with top-down and bottom-up inference. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1267–1284, Dubrovnik, Croatia. Association for Computational Linguistics.
- Movie summarization via sparse graph construction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13631–13639.
- Pinelopi Papalampidi and Mirella Lapata. 2023. Hierarchical3d adapters for long video-to-text summarization. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1267–1290.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
- Learning multiple visual domains with residual adapters. volume abs/1705.08045.
- T. Sheng and M. Huber. 2020. Unsupervised embedding learning for human activity recognition using wearable sensor data. In Proc. FLAIRS.
- Structure-infused copy mechanisms for abstractive summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1717–1729, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Pearl: Prompting large language models to plan and execute actions over long documents.
- Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern  recognition, pages 4631–4640.
- Llama: Open and efficient foundation language models.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212.
- Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560–576.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296.
- Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17939–17948.
- Pegasus: pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Open-book video captioning with retrieve-copy-generate network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9837–9846.
- Towards automatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.