Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension (2311.15879v2)

Published 27 Nov 2023 in cs.CV

Abstract: LLMs-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters, we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names, enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model, which was trained only on the COCO dataset, can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap, with only 3.97M trainable parameters, exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training.

Retrieval-Augmented Image Captioning with External Visual–Name Memory

Image captioning (IC) has seen substantial advancements through the application of LLMs, allowing for comprehensive descriptions of images based on extensive datasets. However, the static nature and high computational demands of such models present challenges in adapting to novel objects that frequently emerge in open-world settings. The paper introduces an innovative method, EVCAP, which aims to enhance the dynamic comprehension and adaptability of image captioning systems without the need for expansive datasets or extensive computational resources.

Overview of EVCAP

EVCAP proposes a retrieval-augmented approach that leverages a minimal yet effective external visual-name memory to update object knowledge dynamically. The model integrates a lightweight and easily expandable external memory, which consists of visual features and their corresponding object names. This structure allows the model to retrieve relevant object names as prompts for a frozen pre-trained LLM decoder when generating captions.

Key Components

The EVCAP architecture is built upon several components:

  • External Visual-Name Memory: This memory contains visual features as keys and object names as values, allowing for efficient retrieval of object-specific descriptions.
  • Image Encoding Module: Utilizing a frozen vision encoder, EVCAP extracts visual features, augmented by trainable image query tokens, facilitating precise object name retrieval from the memory.
  • Attentive Fusion Module: This module performs cross-attention between retrieved object names and visual features to refine the captioning process, mitigating redundant or irrelevant data incorporation.
  • Frozen LLM Decoder: EVCAP employs a frozen Vicuna-13B model that utilizes the fused prompt of object names and visual features to generate the final captions.

Experimental Results

EVCAP demonstrates remarkable performance across standard IC benchmarks including COCO, NoCaps, and Flickr30K, with improvements in CIDEr scores. Remarkably, it achieves competitive results with only 3.97M trainable parameters, a testament to its efficiency in comparison to other state-of-the-art models requiring considerably larger computational resources. The evaluations show EVCAP competently handles both in-domain and out-of-domain data, underscoring its robustness in diverse settings.

Moreover, the integration of commonsense-violating images from the WHOOPS dataset affirms EVCAP's adaptability. When the external memory is updated with data from the WHOOPS dataset, the model shows notable improvement in handling novel, unconventional scenarios, reflecting its extendibility and practical applicability.

Implications and Future Directions

EVCAP stands as a seminal contribution toward sustainable and scalable image captioning solutions adaptable to ever-evolving real-world scenarios. The minimal cost of memory updates and adaptability without retraining provide a paradigm shift in the economic feasibility of maintaining up-to-date object knowledge. This is critical for deploying image captioning technologies in dynamic domains such as autonomous driving and real-time analytics.

The paper opens avenues for further exploration in retrieval-augmented methodologies, highlighting potential integrations with object detection systems to enhance the completeness of image descriptions. Future research could also examine the application of this approach across other multimodal tasks, potentially redefining how external memory is utilized for understanding complex image and text relationships in LLMs.

In conclusion, EVCAP presents a sophisticated yet resource-efficient framework for image captioning, balancing precision and adaptability, essential for the advancement of open-world AI comprehension.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. https://huggingface.co/spaces/RitaParadaRamos/SmallCapDemo.
  2. Nocaps: Novel object captioning at scale. In Proc. IEEE International Conference on Computer Vision (ICCV), 2019.
  3. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  4. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  6. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  8. PaLI-X: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023a.
  9. PaLI: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations (ICLR), 2023b.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
  13. Rca-noc: Relative contrastive alignment for novel object captioning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
  14. Eva: Exploring the limits of masked visual representation learning at scale. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  15. Transferable decoding with visual entities for zero-shot image captioning. In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
  16. Zhengcong Fei. Memory-augmented image captioning. In Proc. AAAI Conference on Artificial Intelligence (AAAI), 2021.
  17. LVIS: A dataset for large vocabulary instance segmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  18. Will large-scale generative models corrupt future datasets? In Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
  19. Deep compositional captioning: Describing novel object categories without paired training data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  20. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  21. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  22. Deep visual-semantic alignments for generating image descriptions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  23. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proc. International conference on machine learning (ICML), 2022.
  25. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. International conference on machine learning (ICML), 2023.
  26. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. European Conference on Computer Vision (ECCV), 2020.
  27. Microsoft coco: Common objects in context. In Proc. European Conference on Computer Vision (ECCV), 2014.
  28. Neural baby talk. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  29. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  30. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proc. IEEE International Conference on Computer Vision (ICCV), 2015.
  31. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  32. Learning transferable visual models from natural language supervision. In Proc. International conference on machine learning (ICML), 2021.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020.
  34. Retrieval-augmented image captioning. In Proc. Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023a.
  35. Smallcap: Lightweight image captioning prompted with retrieval augmentation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  36. High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  38. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research (JMLR), 9(11), 2008.
  39. Captioning images with diverse objects. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  40. Noc-rek: Novel object captioning with retrieved vocabulary from external knowledge. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  41. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  42. Show, attend and tell: Neural image caption generation with visual attention. In Proc. International conference on machine learning (ICML).
  43. Re-ViLM: Retrieval-augmented visual language model for zero and few-shot image captioning. In Findings of the Association for Computational Linguistics: EMNLP, 2023.
  44. Retrieval-augmented multimodal language modeling. In International Conference on Learning Representations (ICLR), 2023.
  45. Vinvl: Revisiting visual representations in vision-language models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  46. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiaxuan Li (52 papers)
  2. Duc Minh Vo (16 papers)
  3. Akihiro Sugimoto (21 papers)
  4. Hideki Nakayama (59 papers)
Citations (11)
Youtube Logo Streamline Icon: https://streamlinehq.com