Papers
Topics
Authors
Recent
Search
2000 character limit reached

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

Published 6 Jul 2023 in cs.CL and cs.CV | (2307.02716v1)

Abstract: Multimodal summarization usually suffers from the problem that the contribution of the visual modality is unclear. Existing multimodal summarization approaches focus on designing the fusion methods of different modalities, while ignoring the adaptive conditions under which visual modalities are useful. Therefore, we propose a novel Coarse-to-Fine contribution network for multimodal Summarization (CFSum) to consider different contributions of images for summarization. First, to eliminate the interference of useless images, we propose a pre-filter module to abandon useless images. Second, to make accurate use of useful images, we propose two levels of visual complement modules, word level and phrase level. Specifically, image contributions are calculated and are adopted to guide the attention of both textual and visual modalities. Experimental results have shown that CFSum significantly outperforms multiple strong baselines on the standard benchmark. Furthermore, the analysis verifies that useful images can even help generate non-visual words which are implicitly represented in the image.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6077–6086.
  2. See, hear, read: Leveraging multimodality with guided attention for abstractive text summarization. Know.-Based Syst., 227(C).
  3. Doubly-attentive decoder for multi-modal neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1913–1924, Vancouver, Canada. Association for Computational Linguistics.
  4. Uniter: Universal image-text representation learning. In Computer Vision – ECCV 2020, pages 104–120, Cham. Springer International Publishing.
  5. Empirical evaluation of gated recurrent neural networks on sequence modeling.
  6. James Clarke and Mirella Lapata. 2008. Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research, pages 399–429.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Text-image-video summary generation using joint integer linear programming. In Advances in Information Retrieval, pages 190–198, Cham. Springer International Publishing.
  9. A survey on multi-modal summarization. CoRR, abs/2109.05199.
  10. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  11. Dm2s2: Deep multimodal sequence sets with hierarchical modality attention. IEEE Access, 10:120023–120034.
  12. Aspect-aware multimodal summarization for chinese e-commerce products. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8188–8195.
  13. Multi-modal sentence summarization with modality attention and image filtering. In International Joint Conference on Artificial Intelligence.
  14. Multi-modal sentence summarization with modality attention and image filtering. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4152–4158. International Joint Conferences on Artificial Intelligence Organization.
  15. Multimodal sentence summarization via multimodal selective encoding. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5655–5667, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  16. Inter- and intra-modal contrastive hybrid learning framework for multimodal abstractive summarization. Entropy, 24(6).
  17. Keep meeting summaries on topic: Abstractive multi-modal meeting summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2190–2196, Florence, Italy. Association for Computational Linguistics.
  18. Jindřich Libovický and Jindřich Helcl. 2017. Attention strategies for multi-source sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 196–202, Vancouver, Canada. Association for Computational Linguistics.
  19. Chin-Yew Lin and Eduard Hovy. 2002. Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization, pages 45–51, Phildadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  20. Multistage fusion with forget gate for multimodal summarization in open-domain videos. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1834–1845, Online. Association for Computational Linguistics.
  21. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3242–3250.
  22. Multimodal abstractive summarization for how2 videos. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6587–6596, Florence, Italy. Association for Computational Linguistics.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  24. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
  25. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
  26. Vision guided generative pre-trained language models for multimodal abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3995–4007, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Which is making the contribution: Modulating unimodal and cross-modal dynamics for multimodal sentiment analysis. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1262–1274, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  28. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  29. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
  30. Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1095–1104, Vancouver, Canada. Association for Computational Linguistics.
  31. MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4154–4164, Brussels, Belgium. Association for Computational Linguistics.
Citations (7)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.