UniRAG: Universal Retrieval Augmentation for Multi-Modal Large Language Models (2405.10311v2)

Published 16 May 2024 in cs.IR

Abstract: Recently, Multi-Modal (MM) LLMs have unlocked many complex use-cases that require MM understanding (e.g., image captioning or visual question answering) and MM generation (e.g., text-guided image generation or editing) capabilities. To further improve the output fidelity of MM-LLMs we introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models like GPT-4o and Gemini-Pro and smaller open-source models like LLaVA, LaVIT, and Emu2 significantly enhance their generation quality when their input prompts are augmented with relevant information retrieved by MM retrievers like UniIR models.

PDF HTML Abstract

The paper proposes a model-agnostic retrieval augmentation architecture for multi-modal LLMs (MM-LLMs). The approach, termed UniRAG, integrates a two-stage workflow where an external multi-modal retriever first identifies contextually relevant candidates and then appends these as in-context examples to the prompt provided to the generator. This modular design can be seamlessly applied to different downstream tasks, including image captioning (image-to-text) and text-to-image generation.

The methodology is structured around two primary components:

Retriever Stage:

The system employs UniIR models configured with CLIP Score Fusion (CLIP-SF) and BLIP Feature Fusion (BLIP-FF), which are instruction-tuned on diverse multi-modal datasets. Specifically, these retrievers combine text and image representations either by a weighted sum of encoder outputs (CLIP-SF) or by utilizing cross-attention layers to fuse features (BLIP-FF). For a query in one modality, the retrievers extract the top $k$ candidates from a global multi-modal database using Faiss for efficient dot product similarity computation. In image captioning, for example, an input image retrieves candidate captions; for image generation, an input text query retrieves candidate images.

Generator Stage:
- For caption generation, models such as Llava (13B), GPT-4, and Gemini-Pro are employed.
- For text-to-image generation, LaVIT (LLaMA-7B backbone) and Emu2-Gen (LLaMA-33B backbone) are used.

The experimental results systematically compare the baseline performance ( $k=0$ ) with retrieval-augmented (few-shot) performance. For instance, adding just one retrieved caption in the prompt yields an improvement of approximately 10 SPICE units on the MSCOCO image captioning task relative to the zero-shot baseline. In the text-to-image generation task, augmentation leads to a decrease of roughly 30 FID units.

Several insights are reported regarding in-context example selection:

Sensitivity to the Number of Examples:

While a single retrieved example consistently boosts generation quality, increasing $k$ beyond one can lead to performance degradation in certain configurations, particularly for smaller models. This suggests that the relevance of retrieved candidates is crucial; introducing too many less-relevant examples can confuse the generative process.

Retriever Model Comparison:

Although the baseline performance of the CLIP-SF retriever is generally superior to BLIP-FF, the effectiveness gap narrows when applied within the retrieval augmentation framework. This observation implies that even retrievers with modest baseline performance can contribute positively when their outputs are integrated as in-context examples.

Model-Specific Findings:

The experiments further reveal that while models like Llava perform relatively well in zero-shot settings, both proprietary models such as GPT-4 and Gemini-Pro are capable of leveraging additional retrieved examples to continuously improve output fidelity. In text-to-image generation, the results also demonstrate that Emu2-Gen, with its larger LLaMA-33B backbone, outperforms the smaller LaVIT model, and both benefit from a single retrieved example, although excessive augmentation can again be detrimental.

Performance metrics across tasks include several n-gram overlap measures (BLEU-1 to BLEU-4, CIDEr, ROUGE) and semantic alignment metrics (SPICE for caption generation; FID, CLIP Score, and Inception Score for image generation). For example, quantitative evaluation on the MSCOCO dataset reveals that the retrieval-augmented prompt setup significantly improves the quality of outputs over the zero-shot baseline across these metrics. Notably, the paper emphasizes that retrieval augmentation not only refines the handling of uncommon entities but also improves the generation quality for common entities, which challenges the conventional expectation that retrieval techniques mainly aid in processing rare content.

The paper concludes with discussions on future work, including:

Extending the evaluation to out-of-domain retrieval settings to assess the robustness of the approach when the candidate pool is not from the same distribution as the target task.
Investigating alternative prompt templates and relevance-based candidate selection mechanisms to optimize the number and quality of in-context examples dynamically.

Overall, the work presents a comprehensive evaluation of combining retrieval augmentation with MM-LLMs, yielding consistent improvements in multi-modal generation tasks and providing valuable insights into the interaction between retrieval quality and generative performance.