Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive Text Summarization (TL;DR) of Scientific Contents (2306.13968v1)
Abstract: The realm of scientific text summarization has experienced remarkable progress due to the availability of annotated brief summaries and ample data. However, the utilization of multiple input modalities, such as videos and audio, has yet to be thoroughly explored. At present, scientific multimodal-input-based text summarization systems tend to employ longer target summaries like abstracts, leading to an underwhelming performance in the task of text summarization. In this paper, we deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities. To this end, we introduce mTLDR, a first-of-its-kind dataset for the aforementioned task, comprising videos, audio, and text, along with both author-composed summaries and expert-annotated summaries. The mTLDR dataset accompanies a total of 4,182 instances collected from various academic conference proceedings, such as ICLR, ACL, and CVPR. Subsequently, we present mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused hyper-complex Transformer combined with a Wasserstein Riemannian Encoder Transformer, to dexterously capture the intricacies between different modalities in a hyper-complex latent geometric space. The hyper-complex Transformer captures the intrinsic properties between the modalities, while the Wasserstein Riemannian Encoder Transformer captures the latent structure of the modalities in the latent space geometry, thereby enabling the model to produce diverse sentences. mTLDRgen outperforms 20 baselines on mTLDR as well as another non-scientific dataset (How2) across three Rouge-based evaluation measures. Furthermore, based on the qualitative metrics, BERTScore and FEQA, and human evaluations, we demonstrate that the summaries generated by mTLDRgen are fluent and congruent to the original source material.
- See, hear, read: Leveraging multimodality with guided attention for abstractive text summarization. Knowledge-Based Systems 227 (2021), 107152. https://doi.org/10.1016/j.knosys.2021.107152
- Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1409.0473
- Marco Baroni. 2016. Grounding distributional semantics in the visual world. Language and Linguistics Compass 10, 1 (2016), 3–13.
- Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).
- TLDR: Extreme Summarization of Scientific Documents. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4766–4777. https://doi.org/10.18653/v1/2020.findings-emnlp.428
- Jaime Carbinell and Jade Goldstein. 2017. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. SIGIR Forum 51, 2 (Aug. 2017), 209–210. https://doi.org/10.1145/3130348.3130369
- Jingqiang Chen and Hai Zhuge. 2018. Abstractive text-image summarization using multi-modal attentional hierarchical rnn. In Proceedings of the 2018 Conference on EMNLP. 4046–4056.
- Jianpeng Cheng and Mirella Lapata. 2016. Neural Summarization by Extracting Sentences and Words. In Proceedings of the 54th Annual Meeting of the ACL (Volume 1: Long Papers). ACL, Berlin, Germany, 484–494. https://doi.org/10.18653/v1/P16-1046
- Neural Abstractive Summarization with Structural Attention. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Christian Bessiere (Ed.). IJCAI, 3716–3722. https://doi.org/10.24963/ijcai.2020/514 Main track.
- Monotonic alignments for summarization. Knowledge-Based Systems 192 (2020), 105363. https://doi.org/10.1016/j.knosys.2019.105363
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the ACL: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Aniqa Dilawari and Muhammad Usman Ghani Khan. 2019. ASoVS: Abstractive Summarization of Video Sequences. IEEE Access 7 (2019), 29253–29263. https://doi.org/10.1109/ACCESS.2019.2902507
- G. Erkan and D. R. Radev. 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research 22 (Dec 2004), 457–479. https://doi.org/10.1613/jair.1523
- Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of the 57th Annual Meeting of the ACL. ACL, Florence, Italy, 1074–1084. https://doi.org/10.18653/v1/P19-1102
- Bottom-Up Abstractive Summarization. In Proceedings of the 2018 Conference on EMNLP. ACL, Brussels, Belgium, 4098–4109. https://doi.org/10.18653/v1/D18-1443
- A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on EMNLP. 968–974.
- Multimodal data as a means to understand the learning experience. International Journal of Information Management 48 (2019), 108–119. https://doi.org/10.1016/j.ijinfomgt.2019.02.003
- The ICSI Summarization System at TAC 2008.. In Tac.
- Audio Summarization with Audio Features and Probability Distribution Divergence. ArXiv abs/2001.07098 (2020).
- Automated News Summarization Using Transformers. In Sustainable Advanced Computing, Sagaya Aurelia, Somashekhar S. Hiremath, Karthikeyan Subramanian, and Saroj Kr. Biswas (Eds.). Springer Singapore, Singapore, 249–259.
- Teaching machines to read and comprehend. In Advances in neural information processing systems. 1693–1701.
- Vladimir Iashin and Esa Rahtu. 2020. Multi-modal Dense Video Captioning. In Proceedings of the IEEE/CVF Conference on CVPR Workshops. 958–959.
- James M. Joyce. 2011. Kullback-Leibler Divergence. Springer Berlin Heidelberg, Berlin, Heidelberg, 720–722. https://doi.org/10.1007/978-3-642-04898-2_327
- The kinetics human action video dataset. CoRR (2017).
- Douwe Kiela. 2017. Deep embodiment: grounding semantics in perceptual modalities. Technical Report. University of Cambridge, Computer Laboratory. 11–129 pages.
- Adapting the Neural Encoder-Decoder Framework from Single to Multi-Document Summarization. In Proceedings of the 2018 Conference on EMNLP. ACL, Brussels, Belgium, 4131–4141. https://doi.org/10.18653/v1/D18-1446
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
- Multi-modal Sentence Summarization with Modality Attention and Image Filtering. In Proceedings of the Twenty-Seventh IJCAI-18. IJCAI, 4152–4158. https://doi.org/10.24963/ijcai.2018/577
- Read, watch, listen, and summarize: Multi-modal summarization for asynchronous text, image, audio and video. IEEE Transactions on Knowledge and Data Engineering 31, 5 (2018), 996–1009.
- Multimedia news summarization in search. ACM Transactions on Intelligent Systems and Technology (TIST) 7, 3 (2016), 1–20.
- Jindřich Libovický and Jindřich Helcl. 2017. Attention Strategies for Multi-Source Sequence-to-Sequence Learning. In Proceedings of the 55th Annual Meeting of the ACL (Volume 2: Short Papers). ACL, Vancouver, Canada, 196–202. https://doi.org/10.18653/v1/P17-2031
- Unsupervised Extractive Text Summarization with Distance-Augmented Sentence Graphs. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada) (SIGIR ’21). Association for Computing Machinery, New York, NY, USA, 2313–2317. https://doi.org/10.1145/3404835.3463111
- Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1834–1845. https://doi.org/10.18653/v1/2020.emnlp-main.144
- SibNet: Sibling Convolutional Encoder for Video Captioning. In Proceedings of the 26th ACM International Conference on Multimedia (Seoul, Republic of Korea) (MM ’18). ACM, New York, NY, USA, 1425–1434. https://doi.org/10.1145/3240508.3240667
- Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019).
- S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 4969–4983. https://doi.org/10.18653/v1/2020.acl-main.447
- CiteSum: Citation Text-guided Scientific Extreme Summarization and Domain Adaptation with Limited Supervision. https://doi.org/10.48550/ARXIV.2205.06207
- Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on EMNLP. ACL, Barcelona, Spain, 404–411. https://www.aclweb.org/anthology/W04-3252
- Text-guided Attention Model for Image Captioning. In AAAI. 4233–4239.
- SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (San Francisco, California, USA) (AAAI’17). AAAI Press, 3075–3081.
- Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. ACL, Berlin, Germany, 280–290. https://doi.org/10.18653/v1/K16-1028
- Multimodal Abstractive Summarization for How2 Videos. In Proceedings of the 57th Annual Meeting of the ACL. ACL, Florence, Italy, 6587–6596. https://doi.org/10.18653/v1/P19-1659
- MHMS: Multimodal Hierarchical Multimedia Summarization. https://doi.org/10.48550/ARXIV.2204.03734
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
- How2: A Large-scale Dataset For Multimodal Language Understanding. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL). NeurIPS, 26.1–26.12. http://arxiv.org/abs/1811.00347
- Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). ACL, Vancouver, Canada, 1073–1083. https://doi.org/10.18653/v1/P17-1099
- Leveraging multimodal information for event summarization and concept-level sentiment analysis. Knowledge-Based Systems 108 (2016), 102–109.
- Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning. Knowledge-Based Systems (2020), 105920.
- Video captioning with boundary-aware hierarchical language decoding and joint video prediction. Neurocomputing 417 (2020), 347–356.
- Structure-Infused Copy Mechanisms for Abstractive Summarization. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 1717–1729. https://www.aclweb.org/anthology/C18-1146
- NewsStories: Illustrating Articles with Visual Summaries. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI (Tel Aviv, Israel). Springer-Verlag, Berlin, Heidelberg, 644–661. https://doi.org/10.1007/978-3-031-20059-5_37
- Yeah right: Sarcasm recognition for spoken dialogue systems. 1838–1841.
- Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels. In Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2016/file/5055cbf43fac3f7e2336b27310f0b9ef-Paper.pdf
- Attention is all you need. In Advances in neural information processing systems. 5998–6008.
- TL;DR: Mining Reddit to Learn Automatic Summarization. In Workshop on New Frontiers in Summarization at EMNLP 2017, Giuseppe Carenini, Jackie Chi Kit Cheung, Fei Liu, and Lu Wang (Eds.). Association for Computational Linguistics, 59–63. https://doi.org/10.18653/v1/W17-4508
- Prince Zizhuang Wang and William Yang Wang. 2019. Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 284–294. https://doi.org/10.18653/v1/N19-1025
- Title Generation for user generated videos. In European conference on computer vision. Springer, 609–625.
- Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n1𝑛1/n1 / italic_n Parameters. https://doi.org/10.48550/ARXIV.2102.08597
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. PMLR, 11328–11339.
- UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation. https://doi.org/10.48550/ARXIV.2109.05812
- ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization. https://doi.org/10.48550/ARXIV.2108.05123
- Hao Zheng and Mirella Lapata. 2019. Sentence centrality revisited for unsupervised summarization. arXiv preprint arXiv:1906.03508 (2019).
- Towards Automatic Learning of Procedures From Web Instructional Videos. In AAAI Conference on Artificial Intelligence. 7590–7598. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17344
- MSMO: Multimodal Summarization with Multimodal Output. In Proceedings of the 2018 Conference on EMNLP. ACL, Brussels, Belgium, 4154–4164. https://doi.org/10.18653/v1/D18-1448
- Yash Kumar Atri (8 papers)
- Vikram Goyal (26 papers)
- Tanmoy Chakraborty (224 papers)