Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion (2306.11593v1)
Abstract: State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. This dataset contains annotations provided by human annotators, who typically produce captions averaging around ten tokens. However, this constraint presents a challenge in effectively capturing complex scenes and conveying detailed information. Furthermore, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects. What would happen if we were able to automatically generate longer captions, thereby making them more detailed? Would these captions, evaluated by humans, be more or less representative of the image content compared to the original MS-COCO captions? In this paper, we present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused, resulting in richer captions. Our proposed method leverages existing models from the literature, eliminating the need for additional training. Instead, it utilizes an image-text based metric to rank the captions generated by SoTA models for a given image. Subsequently, the top two captions are fused using a LLM. Experimental results demonstrate the effectiveness of our approach, as the captions generated by our model exhibit higher consistency with human judgment when evaluated on the MS-COCO test set. By combining the strengths of various SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich, informative nature of human-generated descriptions. This advance opens up new possibilities for generating captions that are more suitable for the training of both vision-language and captioning models.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022.
- Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision (ECCV), pages 382–398. Springer, 2016.
- Bottom-up and top-down attention for image captioning and visual question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6077–6086. IEEE, 2018.
- Sequential latent spaces for modeling the intention during diverse image captioning. In International Conference on Computer Vision (ICCV), pages 4261–4270. IEEE/CVF, 2019.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020.
- Learning distinct and representative modes for image captioning. In Advances in Neural Information Processing Systems, 2022.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning (ICML), pages 1931–1942. PMLR, 2021.
- Long-term recurrent convolutional networks for visual recognition and description. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634. IEEE, 2015.
- Expansionnet v2: Block static expansion in fast end to end training for image captioning. arXiv preprint arXiv:2208.06551, 2022a.
- Scaling up vision-language pre-training for image captioning. In Computer Vision and Pattern Recognition (CVPR), pages 17980–17989. IEEE/CVF, 2022b.
- Attention on attention for image captioning. In International Conference on Computer Vision (ICCV), pages 4634–4643. IEEE/CVF, 2019.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128–3137. IEEE, 2015.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 34:9694–9705, 2021a.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning (ICML). PMLR, 2022.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2592–2607, 2021b.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (ECCV), pages 121–137. Springer, 2020.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), pages 10012–10022. IEEE/CVF, 2021.
- NLP Connect. vit-gpt2-image-captioning (revision 0e334c7), 2022. URL https://huggingface.co/nlpconnect/vit-gpt2-image-captioning.
- Bleu: a method for automatic evaluation of machine translation. In Annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021.
- Neural machine translation of rare words with subword units. In Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, 2016.
- How much can clip benefit vision-and-language tasks? In International Conference on Learning Representations (ICLR), 2022.
- Contrastive estimation: Training log-linear models on unlabeled data. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 354–362, 2005.
- Cider: Consensus-based image description evaluation. In Computer Vision and Pattern Recognition (CVPR), pages 4566–4575. IEEE, 2015.
- GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022a. ISSN 2835-8856.
- OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning (ICML), pages 23318–23340. PMLR, 2022b.
- Faier: Fidelity and adequacy ensured image caption evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 14050–14059. IEEE/CVF, 2021.
- SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations (ICLR), 2022c.
- Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
- Understanding knowledge distillation in non-autoregressive machine translation. In International Conference on Learning Representations (ICLR), 2020.
- Simone Bianco (36 papers)
- Luigi Celona (15 papers)
- Marco Donzella (1 paper)
- Paolo Napoletano (30 papers)