Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization (2305.12767v2)

Published 22 May 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Many-to-many multimodal summarization (M$3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence, which essentially comprises multimodal monolingual summarization (MMS) and multimodal cross-lingual summarization (MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has obtained increasing attention in recent years, little research pays attention to the M$3$S task. Besides, existing studies mainly focus on 1) utilizing MMS to enhance MXLS via knowledge distillation without considering the performance of MMS or 2) improving MMS models by filtering summary-unrelated visual features with implicit learning or explicitly complex training objectives. In this paper, we first introduce a general and practical task, i.e., M$3$S. Further, we propose a dual knowledge distillation and target-oriented vision modeling framework for the M$3$S task. Specifically, the dual knowledge distillation method guarantees that the knowledge of MMS and MXLS can be transferred to each other and thus mutually prompt both of them. To offer target-oriented visual features, a simple yet effective target-oriented contrastive objective is designed and responsible for discarding needless visual information. Extensive experiments on the many-to-many setting show the effectiveness of the proposed approach. Additionally, we will contribute a many-to-many multimodal summarization (M$3$Sum) dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Crosssum: Beyond english-centric cross-lingual abstractive text summarization for 1500+ language pairs.
  2. CLIFF: Contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proceedings of EMNLP, pages 6633–6649, Online and Punta Cana, Dominican Republic.
  3. Jingqiang Chen and Hai Zhuge. 2018. Abstractive text-image summarization using multi-modal attentional hierarchical RNN. In Proc. of EMNLP, pages 4046–4056.
  4. Unifying vision-and-language tasks via text generation. In Proc. of ICML, volume 139, pages 1931–1942.
  5. Multimodal summarization of meeting recordings. In Proc. of ICME, volume 3, pages III–25.
  6. Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Transactions on Multimedia, 15(7):1553–1568.
  7. Joseph L. Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, pages 613–619.
  8. MM-AVS: A full-scale dataset for multi-modal summarization. In Proc. of NAACL-HLT, pages 5922–5926.
  9. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742.
  10. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of ACL-IJCNLP., pages 4693–4703, Online. Association for Computational Linguistics.
  11. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).
  12. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proc. of EMNLP, pages 388–395.
  13. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In Proc. of IJCV, pages 32–73.
  14. Aspect-aware multimodal summarization for chinese e-commerce products. In Proc. of AAAI, volume 34, pages 8188–8195.
  15. Multi-modal sentence summarization with modality attention and image filtering. In Proc. of IJCAI, pages 4152–4158.
  16. Multi-modal summarization for asynchronous collection of text, image, audio and video. In Proc. of EMNLP, pages 1092–1102.
  17. Read, watch, listen, and summarize: Multi-modal summarization for asynchronous text, image, audio and video. IEEE Transactions on Knowledge and Data Engineering, 31(5):996–1009.
  18. VMSMO: Learning to generate multimodal summary for video-based news articles. In Proceedings of EMNLP., pages 9360–9369, Online. Association for Computational Linguistics.
  19. Msctd: A multimodal sentiment chat translation dataset. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2601–2613.
  20. Summary-oriented vision modeling for multimodal abstractive summarization. arXiv preprint arXiv:2212.07672.
  21. Infusing multi-source knowledge with heterogeneous graph neural network for emotional conversation generation. Proc. of AAAI, pages 13343–13352.
  22. Emotional conversation generation with heterogeneous graph neural network. Artificial Intelligence, 308:103714.
  23. A variational hierarchical model for neural cross-lingual summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2088–2099, Dublin, Ireland. Association for Computational Linguistics.
  24. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
  25. Multistage fusion with forget gate for multimodal summarization in open-domain videos. In Proc. of EMNLP, pages 1834–1845.
  26. Assist non-native viewers: Multimodal cross-lingual summarization for how2 videos. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6959–6969, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Yixin Liu and Pengfei Liu. 2021. SimCLS: A simple framework for contrastive learning of abstractive summarization. In Proceedings of ACL, pages 1065–1072, Online.
  28. Thong Thanh Nguyen and Anh Tuan Luu. 2022. Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation. In Proceedings of AAAI., volume 36, pages 11103–11111.
  29. Multimodal abstractive summarization for how2 videos. In Proc. of ACL, pages 6587–6596.
  30. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proc. of NIPS, volume 28.
  31. How2: a large-scale dataset for multimodal language understanding. In Proc. of the Workshop on ViGIL.
  32. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.
  33. Multi-modal summarization of key events and top players in sports tournament videos. In Proc. of IEEE WACV, pages 471–478.
  34. Attention is all you need. In Proc. of NIPS, pages 5998–6008.
  35. Contrastive aligned joint learning for multilingual summarization. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2739–2750, Online. Association for Computational Linguistics.
  36. Cross-lingual summarization via chatgpt. arXiv preprint arXiv:2302.14229.
  37. Understanding translationese in cross-lingual summarization. arXiv preprint arXiv:2212.07220.
  38. A survey on cross-lingual summarization. Transactions of the Association for Computational Linguistics, 10:1304–1323.
  39. Towards unifying multi-lingual and cross-lingual summarization. arXiv preprint arXiv:2305.09220.
  40. Sequence level contrastive learning for text summarization. arXiv preprint arXiv:2109.03481.
  41. mT5: A massively multilingual pre-trained text-to-text transformer. In Proc. of NAACL-HLT, pages 483–498.
  42. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proc. of EMNLP, pages 3995–4007.
  43. Hierarchical cross-modality semantic correlation learning model for multimodal summarization. arXiv preprint arXiv:2112.12072.
  44. Towards understanding and improving knowledge distillation for neural machine translation. arXiv preprint arXiv:2305.08096.
  45. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  46. Unims: A unified framework for multimodal summarization with knowledge distillation. arXiv preprint arXiv:2109.05812.
  47. JDDC 2.1: A multimodal Chinese dialogue dataset with joint tasks of query rewriting, response generation, discourse parsing, and summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12037–12051, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  48. Rc3: Regularized contrastive cross-lingual cross-modal pre-training. arXiv preprint arXiv:2305.07927.
  49. MSMO: Multimodal summarization with multimodal output. In Proc. of EMNLP, pages 4154–4164.
  50. Graph-based multimodal ranking models for multimodal summarization. Transactions on Asian and Low-Resource Language Information Processing, 20(4):1–21.
  51. Multimodal summarization with guidance of multimodal reference. In Proc. of AAAI, volume 34, pages 9749–9756.
Citations (10)

Summary

We haven't generated a summary for this paper yet.