Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models (2405.10311v2)

Published 16 May 2024 in cs.IR
UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models

Abstract: Recently, Multi-Modal (MM) LLMs have unlocked many complex use-cases that require MM understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of MM-LLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.

The paper proposes a model-agnostic retrieval augmentation architecture for multi-modal LLMs (MM-LLMs). The approach, termed UniRAG, integrates a two-stage workflow where an external multi-modal retriever first identifies contextually relevant candidates and then appends these as in-context examples to the prompt provided to the generator. This modular design can be seamlessly applied to different downstream tasks, including image captioning (image-to-text) and text-to-image generation.

The methodology is structured around two primary components:

  • Retriever Stage:

The system employs UniIR models configured with CLIP Score Fusion (CLIP-SF) and BLIP Feature Fusion (BLIP-FF), which are instruction-tuned on diverse multi-modal datasets. Specifically, these retrievers combine text and image representations either by a weighted sum of encoder outputs (CLIP-SF) or by utilizing cross-attention layers to fuse features (BLIP-FF). For a query in one modality, the retrievers extract the top kk candidates from a global multi-modal database using Faiss for efficient dot product similarity computation. In image captioning, for example, an input image retrieves candidate captions; for image generation, an input text query retrieves candidate images.

  • Generator Stage:
    • For caption generation, models such as Llava (13B), GPT-4, and Gemini-Pro are employed.
    • For text-to-image generation, LaVIT (LLaMA-7B backbone) and Emu2-Gen (LLaMA-33B backbone) are used.

The experimental results systematically compare the baseline performance (k=0k=0) with retrieval-augmented (few-shot) performance. For instance, adding just one retrieved caption in the prompt yields an improvement of approximately 10 SPICE units on the MSCOCO image captioning task relative to the zero-shot baseline. In the text-to-image generation task, augmentation leads to a decrease of roughly 30 FID units.

Several insights are reported regarding in-context example selection:

  • Sensitivity to the Number of Examples:

While a single retrieved example consistently boosts generation quality, increasing kk beyond one can lead to performance degradation in certain configurations, particularly for smaller models. This suggests that the relevance of retrieved candidates is crucial; introducing too many less-relevant examples can confuse the generative process.

  • Retriever Model Comparison:

Although the baseline performance of the CLIP-SF retriever is generally superior to BLIP-FF, the effectiveness gap narrows when applied within the retrieval augmentation framework. This observation implies that even retrievers with modest baseline performance can contribute positively when their outputs are integrated as in-context examples.

  • Model-Specific Findings:

The experiments further reveal that while models like Llava perform relatively well in zero-shot settings, both proprietary models such as GPT-4 and Gemini-Pro are capable of leveraging additional retrieved examples to continuously improve output fidelity. In text-to-image generation, the results also demonstrate that Emu2-Gen, with its larger LLaMA-33B backbone, outperforms the smaller LaVIT model, and both benefit from a single retrieved example, although excessive augmentation can again be detrimental.

Performance metrics across tasks include several n-gram overlap measures (BLEU-1 to BLEU-4, CIDEr, ROUGE) and semantic alignment metrics (SPICE for caption generation; FID, CLIP Score, and Inception Score for image generation). For example, quantitative evaluation on the MSCOCO dataset reveals that the retrieval-augmented prompt setup significantly improves the quality of outputs over the zero-shot baseline across these metrics. Notably, the paper emphasizes that retrieval augmentation not only refines the handling of uncommon entities but also improves the generation quality for common entities, which challenges the conventional expectation that retrieval techniques mainly aid in processing rare content.

The paper concludes with discussions on future work, including:

  • Extending the evaluation to out-of-domain retrieval settings to assess the robustness of the approach when the candidate pool is not from the same distribution as the target task.
  • Investigating alternative prompt templates and relevance-based candidate selection mechanisms to optimize the number and quality of in-context examples dynamically.

Overall, the work presents a comprehensive evaluation of combining retrieval augmentation with MM-LLMs, yielding consistent improvements in multi-modal generation tasks and providing valuable insights into the interaction between retrieval quality and generative performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Spice: Semantic propositional image caption evaluation. Preprint, arXiv:1607.08822.
  3. Improving language models by retrieving from trillions of tokens. Preprint, arXiv:2112.04426.
  4. Language models are few-shot learners. Preprint, arXiv:2005.14165.
  5. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491.
  6. Microsoft coco captions: Data collection and evaluation server. Preprint, arXiv:1504.00325.
  7. Emu: Enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807.
  8. Realm: Retrieval-augmented language model pre-training. Preprint, arXiv:2002.08909.
  9. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
  10. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  11. Unified language-vision pretraining with dynamic discrete visual tokenization. arXiv preprint arXiv:2309.04669.
  12. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  13. Large language models are zero-shot reasoners. Preprint, arXiv:2205.11916.
  14. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. Preprint, arXiv:2201.12086.
  15. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  16. Improved baselines with visual instruction tuning. Preprint, arXiv:2310.03744.
  17. Visual instruction tuning. Preprint, arXiv:2304.08485.
  18. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  19. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  20. Learning transferable visual models from natural language supervision. Preprint, arXiv:2103.00020.
  21. Zero-shot text-to-image generation. Preprint, arXiv:2102.12092.
  22. Improved techniques for training gans. Advances in neural information processing systems, 29.
  23. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286.
  24. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  26. Cider: Consensus-based image description evaluation. Preprint, arXiv:1411.5726.
  27. Uniir: Training and benchmarking universal multimodal information retrievers. arXiv preprint arXiv:2311.17136.
  28. Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561.
  29. Scaling autoregressive models for content-rich text-to-image generation. Preprint, arXiv:2206.10789.
  30. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591.
  31. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.
  32. Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sahel Sharifymoghaddam (6 papers)
  2. Shivani Upadhyay (9 papers)
  3. Wenhu Chen (134 papers)
  4. Jimmy Lin (208 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com