Papers
Topics
Authors
Recent
2000 character limit reached

VISA: Retrieval Augmented Generation with Visual Source Attribution (2412.14457v1)

Published 19 Dec 2024 in cs.IR

Abstract: Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-LLMs (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents' original look, as well as highlighting the challenges for improvement. Code, data, and model checkpoints will be released.

Summary

  • The paper introduces VISA, a framework that enhances RAG by offering fine-grained visual evidence attribution using bounding boxes.
  • The methodology employs vision-language models to extract specific image regions, achieving bounding box accuracies up to 68.2% in medical documents.
  • Experimental results confirm that integrating visual source attribution improves transparency and user trust in AI-generated answers.

Overview of VISA: Retrieval Augmented Generation with Visual Source Attribution

The paper "VISA: Retrieval Augmented Generation with Visual Source Attribution" introduces a sophisticated approach to improving the verifiability of retrieval-augmented generation (RAG) systems through visual source attribution. The conventional RAG paradigm primarily cites sources at the document level, which often poses challenges in locating precise evidence due to cognitive overload. To address this, the authors propose VISA, a framework that leverages large vision-LLMs (VLMs) to facilitate evidence identification by highlighting specific regions on document images that support generated answers using bounding boxes.

Methodology

VISA integrates visual source attribution into the RAG pipeline by employing VLMs to process retrieved document images in response to user queries. The VLMs not only generate answers but also specify the relevant regions within the document images, returning these as bounding boxes. This approach effectively bridges the gap between answer generation and verifiable source attribution at a finer granularity, surpassing existing text-based approaches that are limited to document-level referencing.

Two datasets are curated to evaluate VISA's effectiveness—Wiki-VISA, derived from Wikipedia page screenshots, and Paper-VISA, based on medical domain documents from PubLayNet. The datasets exhibit varied challenges, such as handling multi-page documents and multimodal content.

Experimental Findings

In the field of RAG systems, VISA demonstrates significant advancements in visual source attribution accuracy. The empirical analysis reveals that models like QWen2-VL-72B, when fine-tuned on these datasets, perform appreciably better than zero-shot prompt methods, achieving bounding box accuracies of up to 54.2% on Wiki-VISA and 68.2% on Paper-VISA in single-candidate scenarios. In a more realistic multi-candidate setting, the model's ability to discern the correct source among several documents is evidenced by a bounding box accuracy of 41.6% for full multi-candidate setups.

VISA's performance is contingent upon the complexity of documents, exhibiting better accuracy on first-page paragraphs compared to multi-page or tabular content. The results highlight the potential of visual grounding in enhancing user trust and the utility of RAG systems by enabling more intuitive evidence verification.

Implications and Future Directions

Practically, VISA presents a substantial contribution to the development of transparent and reliable information retrieval systems, enhancing user interaction by allowing visual verification of AI-generated content. Theoretically, it opens a new dimension in the interplay between VLMs and RAG systems, suggesting that future work might focus on improving the generalizability of the models across diverse document types and addressing the challenges of long documents and complex content structures.

Moreover, VISA could catalyze developments in domains where document verification is paramount, such as medical diagnostics, scientific research, and legal documentation. Future avenues may also include expanding the dataset diversity to better simulate real-world document complexities and improving model architectures to inherently support varied content modalities, potentially integrating textual with visual data.

In conclusion, VISA sets a foundational framework for embedding visual source attribution in AI-driven query answering systems, enhancing the granularity and transparency of evidence attribution, and paving the way for further innovations in AI verifiability and user trust.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 116 likes.

Upgrade to Pro to view all of the tweets about this paper: