Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Solution for the CVPR2024 NICE Image Captioning Challenge (2404.12739v2)

Published 19 Apr 2024 in cs.CV

Abstract: This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach achieves a CIDEr score of 234.11.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Nice: New frontiers for zero-shot image captioning evaluation. https://nice.lgresearch.ai, 2024.
  2. Shutterstock. https://www.shutterstock.com, 2024.
  3. 2024 nice : New frontiers for zero-shot image captioning evaluation. https://codalab.lisn.upsaclay.fr/my/datasets/download/fdfcff3f-6390-423b-8f99-f6f9dbfc12c8, 2024.
  4. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  5. UNITER: universal image-text representation learning. In ECCV, pages 104–120, 2020.
  6. Deecap: Dynamic early exiting for efficient image captioning. In CVPR, pages 12216–12226, 2022.
  7. Noise-aware image captioning with progressively exploring mismatched words. In AAAI, pages 12091–12099, 2024.
  8. Improving image recognition by retrieving from web-scale image-text data. In CVPR, pages 19295–19304, 2023.
  9. Noise-aware learning from web-crawled image-text data for image captioning. In ICCV, pages 2930–2940, 2023.
  10. Simple but effective: Clip embeddings for embodied ai. In CVPR, pages 14829–14838, 2022.
  11. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, pages 7241–7259, 2022.
  12. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023.
  13. Improving cross-modal alignment with synthetic pairs for text-only image captioning. In AAAI, pages 3864–3872, 2024.
  14. Dual-level collaborative transformer for image captioning. In AAAI, pages 2286–2293, 2021.
  15. Technical report of nice challenge at cvpr 2023: Retrieval-based data discovery and fusion for zero-shot image captioning.
  16. Hello neighbor: Accurate object retrieval with k-reciprocal nearest neighbors. In CVPR, pages 777–784, 2011.
  17. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  18. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10674–10685, 2022.
  19. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  20. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  21. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In CVPR, pages 17918–17928, 2022.
  22. Multimodal few-shot learning with frozen language models. In NeurIPS, pages 200–212, 2021.
  23. Covlr: Coordinating cross-modal consistency and intra-modal relations for vision-language retrieval. In ICME, 2024.
  24. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, pages 23318–23340, 2022a.
  25. Simvlm: Simple visual language model pretraining with weak supervision. In ICLR, 2022b.
  26. The solution for the cvpr2023 nice image captioning challenge. arXiv preprint arXiv:2310.06879, 2023.
  27. Rethinking label-wise cross-modal retrieval from A semantic sharing perspective. In IJCAI, pages 3300–3306, 2021.
  28. Exploiting cross-modal prediction and relation consistency for semisupervised image captioning. IEEE Transactions on Cybernetics, 2022a.
  29. Domfn: A divergence-orientated multi-modal fusion network for resume assessment. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1612–1620, 2022b.
  30. Deep visual-linguistic fusion network considering cross-modal inconsistency for rumor detection. Science China Information Sciences, 66(12):222102, 2023a.
  31. Self-enhancement improves text-image retrieval in foundation visual-language models. In CVPR, 2023b.
  32. Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning. Frontiers of Computer Science, 18(1):181335, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Longfei Huang (5 papers)
  2. Shupeng Zhong (4 papers)
  3. Xiangyu Wu (40 papers)
  4. Ruoxuan Li (6 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com