EAMA : Entity-Aware Multimodal Alignment Based Approach for News Image Captioning (2402.19404v5)
Abstract: News image captioning requires model to generate an informative caption rich in entities, with the news image and the associated news article. Current MLLMs still bear limitations in handling entity information in news image captioning tasks. Besides, generating high-quality news image captions requires a trade-off between sufficiency and conciseness of textual input information. To explore the potential of MLLMs and address problems we discovered, we propose EAMA: an Entity-Aware Multimodal Alignment based approach for News Image Captioning. Our approach first aligns the MLLM with two extra alignment tasks: Entity-Aware Sentence Selection task and Entity Selection task, together with News Image Captioning task. The aligned MLLM will utilize the additional entity-related information extracted by itself to supplement the textual input while generating news image captions. Our approach achieves better results than all previous models on two mainstream news image captioning datasets.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Good news, everyone! context driven entity-aware captioning for news images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Entity-aware image caption generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4013–4023, Brussels, Belgium. Association for Computational Linguistics.
- Gpt-4 technical report.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Visually-aware context modeling for news image captioning. arXiv preprint arXiv:2308.08325.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- “let’s not quote out of context”: Unified vision-language pretraining for context assisted image captioning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 695–706, Toronto, Canada. Association for Computational Linguistics.
- Transform and tell: Entity-aware news image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Journalistic guidelines aware news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5162–5175, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- Fine-tuning with multi-modal entity prompts for news image captioning. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 4365–4373, New York, NY, USA. Association for Computing Machinery.
- Junzhe Zhang and Xiaojun Wan. 2023. Exploring the impact of vision features in news image captioning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12923–12936, Toronto, Canada. Association for Computational Linguistics.
- Informative image captioning with external sources of information. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6485–6494, Florence, Italy. Association for Computational Linguistics.
- Focus! relevant and sufficient context selection for news image captioning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6078–6088, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Unsupervised vision-and-language pre-training via retrieval-based multi-granular alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16485–16494.
- Junzhe Zhang (27 papers)
- Huixuan Zhang (10 papers)
- Xiaojun Wan (99 papers)
- Xunjian Yin (17 papers)