- The paper introduces VISA, a framework that enhances RAG by offering fine-grained visual evidence attribution using bounding boxes.
- The methodology employs vision-language models to extract specific image regions, achieving bounding box accuracies up to 68.2% in medical documents.
- Experimental results confirm that integrating visual source attribution improves transparency and user trust in AI-generated answers.
Overview of VISA: Retrieval Augmented Generation with Visual Source Attribution
The paper "VISA: Retrieval Augmented Generation with Visual Source Attribution" introduces a sophisticated approach to improving the verifiability of retrieval-augmented generation (RAG) systems through visual source attribution. The conventional RAG paradigm primarily cites sources at the document level, which often poses challenges in locating precise evidence due to cognitive overload. To address this, the authors propose VISA, a framework that leverages large vision-LLMs (VLMs) to facilitate evidence identification by highlighting specific regions on document images that support generated answers using bounding boxes.
Methodology
VISA integrates visual source attribution into the RAG pipeline by employing VLMs to process retrieved document images in response to user queries. The VLMs not only generate answers but also specify the relevant regions within the document images, returning these as bounding boxes. This approach effectively bridges the gap between answer generation and verifiable source attribution at a finer granularity, surpassing existing text-based approaches that are limited to document-level referencing.
Two datasets are curated to evaluate VISA's effectiveness—Wiki-VISA, derived from Wikipedia page screenshots, and Paper-VISA, based on medical domain documents from PubLayNet. The datasets exhibit varied challenges, such as handling multi-page documents and multimodal content.
Experimental Findings
In the field of RAG systems, VISA demonstrates significant advancements in visual source attribution accuracy. The empirical analysis reveals that models like QWen2-VL-72B, when fine-tuned on these datasets, perform appreciably better than zero-shot prompt methods, achieving bounding box accuracies of up to 54.2% on Wiki-VISA and 68.2% on Paper-VISA in single-candidate scenarios. In a more realistic multi-candidate setting, the model's ability to discern the correct source among several documents is evidenced by a bounding box accuracy of 41.6% for full multi-candidate setups.
VISA's performance is contingent upon the complexity of documents, exhibiting better accuracy on first-page paragraphs compared to multi-page or tabular content. The results highlight the potential of visual grounding in enhancing user trust and the utility of RAG systems by enabling more intuitive evidence verification.
Implications and Future Directions
Practically, VISA presents a substantial contribution to the development of transparent and reliable information retrieval systems, enhancing user interaction by allowing visual verification of AI-generated content. Theoretically, it opens a new dimension in the interplay between VLMs and RAG systems, suggesting that future work might focus on improving the generalizability of the models across diverse document types and addressing the challenges of long documents and complex content structures.
Moreover, VISA could catalyze developments in domains where document verification is paramount, such as medical diagnostics, scientific research, and legal documentation. Future avenues may also include expanding the dataset diversity to better simulate real-world document complexities and improving model architectures to inherently support varied content modalities, potentially integrating textual with visual data.
In conclusion, VISA sets a foundational framework for embedding visual source attribution in AI-driven query answering systems, enhancing the granularity and transparency of evidence attribution, and paving the way for further innovations in AI verifiability and user trust.