Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting Pseudo Image Captions for Multimodal Summarization (2305.05496v2)

Published 9 May 2023 in cs.CL

Abstract: Cross-modal contrastive learning in vision language pretraining (VLP) faces the challenge of (partial) false negatives. In this paper, we study this problem from the perspective of Mutual Information (MI) optimization. It is common sense that InfoNCE loss used in contrastive learning will maximize the lower bound of MI between anchors and their positives, while we theoretically prove that MI involving negatives also matters when noises commonly exist. Guided by a more general lower bound form for optimization, we propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images instead of improperly minimizing it. Our method performs competitively on four downstream cross-modal tasks and systematically balances the beneficial and harmful effects of (partial) false negative samples under theoretical guidance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Jingqiang Chen and Hai Zhuge. 2018. Abstractive text-image summarization using multi-modal attentional hierarchical rnn. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4046–4056.
  2. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  3. Vse++: Improving visual-semantic embeddings with hard negatives. In BMVC.
  4. Multi-modal summarization for video-containing documents. ArXiv, abs/2009.08018.
  5. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  6. Self-supervised multimodal opinion summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 388–403, Online. Association for Computational Linguistics.
  7. Text-image-video summary generation using joint integer linear programming. Advances in Information Retrieval, 12036:190 – 198.
  8. Multi-modal summary generation using multi-objective optimization. Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.
  9. Multi-modal supplementary-complementary summarization using multi-objective optimization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 818–828, New York, NY, USA. Association for Computing Machinery.
  10. Multi-modal supplementary-complementary summarization using multi-objective optimization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 818–828.
  11. Vision language pre-training by contrastive learning with cross-modal similarity regulation. In Annual Meeting of the Association for Computational Linguistics.
  12. D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. Computer Science.
  13. H. W. Kuhn. 2010. The hungarian method for the assignment problem. Naval Research Logistics, 52(1-2):7–21.
  14. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv, abs/1910.13461.
  15. Aspect-aware multimodal summarization for chinese e-commerce products. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8188–8195.
  16. Multi-modal sentence summarization with modality attention and image filtering. In IJCAI.
  17. Multi-modal summarization for asynchronous collection of text, image, audio and video. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1092–1102.
  18. Vmsmo: Learning to generate multimodal summary for video-based news articles. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9360–9369.
  19. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
  20. Multistage fusion with forget gate for multimodal summarization in open-domain videos. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1834–1845.
  21. Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3730–3740.
  22. Deepqamvs: Query-aware hierarchical pointer networks for multi-video summarization. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 1389–1399, New York, NY, USA. Association for Computing Machinery.
  23. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759.
  24. Multimodal abstractive summarization for how2 videos. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6587–6596.
  25. Pytorch: An imperative style, high-performance deep learning library.
  26. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
  27. How2: A large-scale dataset for multimodal language understanding. In NeurIPS.
  28. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
  29. A survey of the state-of-the-art models in neural abstractive text summarization. IEEE Access, 9:13248–13265.
  30. Abstractive document summarization with a graph-based attentional neural model. In ACL.
  31. Multimodal summarization of complex sentences. In Proceedings of the 16th international conference on Intelligent user interfaces, pages 43–52.
  32. Attention is all you need. arXiv.
  33. Heterogeneous graph neural networks for extractive document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6209–6219.
  34. Wen Xiao and Giuseppe Carenini. 2019. Extractive summarization of long documents by combining global and local context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3011–3021.
  35. Vision guided generative pre-trained language models for multimodal abstractive summarization. ArXiv, abs/2109.02401.
  36. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pages 11328–11339. PMLR.
  37. Hierarchical cross-modality semantic correlation learning model for multimodal summarization. ArXiv, abs/2112.12072.
  38. Unims: A unified framework for multimodal summarization with knowledge distillation. ArXiv, abs/2109.05812.
  39. Extractive summarization as text matching. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6197–6208.
  40. Msmo: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4154–4164.
  41. Multimodal summarization with guidance of multimodal reference. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9749–9756.
Citations (12)

Summary

We haven't generated a summary for this paper yet.