Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs (2405.18740v1)

Published 29 May 2024 in cs.CL

Abstract: Despite impressive advances in recent multimodal LLMs (MLLMs), state-of-the-art models such as from the GPT-4 suite still struggle with knowledge-intensive tasks. To address this, we consider Reverse Image Retrieval (RIR) augmented generation, a simple yet effective strategy to augment MLLMs with web-scale reverse image search results. RIR robustly improves knowledge-intensive visual question answering (VQA) of GPT-4V by 37-43%, GPT-4 Turbo by 25-27%, and GPT-4o by 18-20% in terms of open-ended VQA evaluation metrics. To our surprise, we discover that RIR helps the model to better access its own world knowledge. Concretely, our experiments suggest that RIR augmentation helps by providing further visual and textual cues without necessarily containing the direct answer to a query. In addition, we elucidate cases in which RIR can hurt performance and conduct a human evaluation. Finally, we find that the overall advantage of using RIR makes it difficult for an agent that can choose to use RIR to perform better than an approach where RIR is the default setting.

PDF HTML Abstract

Reverse Image Retrieval Augmentation for Multimodal LLMs

Overview

This paper addresses a significant limitation in current multimodal LLMs (MLLMs), such as those from the GPT-4 suite, which struggle with knowledge-intensive tasks. The authors propose Reverse Image Retrieval (RIR) augmented generation, a novel strategy that enhances MLLMs with web-scale reverse image search results. This approach significantly improves performance in knowledge-intensive visual question answering (VQA) tasks.

Key Contributions

The primary contributions of the paper are as follows:

RIR Augmentation: The paper introduces RIR as a method to augment MLLMs by providing additional visual and textual cues from web-scale reverse image searches.
Performance Gains: Experimental results show substantial improvements in VQA performance. For instance, GPT-4V improved by 37-43%, GPT-4 Turbo by 25-27%, and GPT-4o by 18-20%.
Surprising Discoveries: RIR aids in better accessing the model's own world knowledge, rather than only providing direct answers.
Evaluation and Analysis: The paper elaborates on scenarios where RIR is beneficial or detrimental and conducts human evaluations to validate findings.
Insights into MLLMs: The research highlights that MLLMs possess more world knowledge than they can typically access, and RIR helps bridge this gap.

Methodology

Reverse Image Retrieval (RIR) Pipeline

The RIR pipeline involves using a browser-based API to perform reverse image searches. Screenshots of the search results, comprising multiple images and captions, are returned as context to the MLLM. This straightforward strategy effectively integrates multimodal cues from the web into the MLLM's processing pipeline.

Experimental Setup

Experiments were conducted on two datasets: INFOSEEK and SnakeCLEF. INFOSEEK encompasses diverse world knowledge questions, while SnakeCLEF focuses on identifying various snake species from images. The evaluation metrics included GPT-as-judge Accuracy and Answer-in-prediction Recall for INFOSEEK, and Binomial-EM, Genus-EM, Binomial-Recall, and Genus-Recall for SnakeCLEF.

Results

The results demonstrate that RIR consistently improves MLLM performance across different models and datasets. For example, GPT-4V saw a 42.55% increase in GPT-as-judge Accuracy on INFOSEEK, and GPT-4o's Binomial-EM metric for SnakeCLEF improved by more than 2x. These improvements underscore RIR's potential to enhance MLLM capabilities in knowledge-intensive tasks.

Analysis

Access to World Knowledge

A critical finding is that RIR does not necessarily provide the direct answer but helps align the visual query with the model's latent knowledge. This was illustrated through an experiment with INFOSEEK, where rephrased text-only questions containing oracle-provided entities showed that MLLMs have the necessary factual knowledge but struggle to access it from visual prompts alone.

Impact on Long-tail Concepts

The paper also shows that RIR is particularly beneficial for long-tail concepts and objects, which are underrepresented in training datasets. This was evidenced by a higher improvement rate for less common entities, as determined by Google search result counts.

Implications and Future Directions

Practical Implications

Practically, RIR can be seamlessly integrated into MLLM-based applications to provide more accurate responses in knowledge-intensive domains. This could be beneficial in areas such as medical diagnosis, rare species identification, and other specialized fields.

Theoretical Implications

Theoretically, the findings suggest that while MLLMs possess extensive world knowledge, a significant challenge lies in accessing and leveraging this knowledge effectively. RIR serves as a tool to bridge this gap, opening new avenues for enhancing multimodal understanding in LLMs.

Future Developments

Future work could explore more sophisticated RIR techniques, such as fine-grained parsing of search results and integration with browsing agents. Additionally, understanding how integrated multimodal training from scratch compares to the current fusion of pre-trained backbones could provide deeper insights into optimizing MLLM performance.

Conclusion

This paper makes notable strides in enhancing MLLMs for knowledge-intensive tasks through RIR augmentation. The findings reveal that RIR not only augments knowledge but also better aligns visual queries with the model's latent knowledge, marking a significant step forward in the practical and theoretical development of MLLMs.