UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities (2504.20734v2)

Published 29 Apr 2025 in cs.CL, cs.AI, cs.CV, cs.IR, and cs.LG

Abstract: Retrieval-Augmented Generation (RAG) has shown substantial promise in improving factual accuracy by grounding model responses with external knowledge relevant to queries. However, most existing RAG approaches are limited to a text-only corpus, and while recent efforts have extended RAG to other modalities such as images and videos, they typically operate over a single modality-specific corpus. In contrast, real-world queries vary widely in the type of knowledge they require, which a single type of knowledge source cannot address. To address this, we introduce UniversalRAG, a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities. Specifically, motivated by the observation that forcing all modalities into a unified representation space derived from a single aggregated corpus causes a modality gap, where the retrieval tends to favor items from the same modality as the query, we propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it. Also, beyond modality, we organize each modality into multiple granularity levels, enabling fine-tuned retrieval tailored to the complexity and scope of the query. We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over various modality-specific and unified baselines.

PDF Abstract

UniversalRAG: A Robust RAG Framework for Heterogeneous Modality Integration

The paper introduces UniversalRAG, a novel framework in Retrieval-Augmented Generation (RAG) designed to address the limitations of traditional RAG implementations that often restrict themselves to a single modality text corpus. RAG, known for enhancing factual accuracy in LLMs by incorporating external knowledge, is extended in this work to operate across a diverse range of modalities including text, images, and videos. The approach is motivated by the need to address varied user queries that existing frameworks, limited by modality or granularity, cannot adequately handle.

UniversalRAG incorporates a modality-aware routing mechanism that selects the most relevant modality-specific corpus for a given query. This is designed to overcome the "modality gap" challenge observed when forcing all modalities into a unified representation space, which can lead to retrievals biased towards the query's modality. This scenario is particularly visualized through experiments that show the tendency for text-based queries to preferentially retrieve text, disregarding potentially relevant visual data. Furthermore, UniversalRAG organizes data not only by modality but also by granularity, allowing the framework to finely tune retrievals to the complexity and scope of the queries.

UniversalRAG's architecture is validated across eight distinct benchmarks, demonstrating its superiority over baseline modality-specific and unified retrieval approaches. This includes single-modality baselines like text, image, and video, and a unified baseline that aggregates all data into a single multimodal embedding space. The results indicate that UniversalRAG consistently achieves higher average scores, underscoring its robust retrieval performance across diverse query types.

In exploring the efficacy of integrating multiple modalities and granularities, the framework employs both training-free and trained routing methodologies. The training-free router leverages a pre-trained LLM's inherent capabilities to classify queries into retrieval categories without additional training, while trained routers are developed using a specifically constructed training dataset created from benchmark modality biases. While the trained routers exhibit superior routing accuracy on in-domain datasets, the training-free router demonstrates commendable generalizability on out-of-domain datasets, suggesting a valuable trade-off between precision and adaptability.

UniversalRAG also underscores the significance of optimal granularity in retrieval processes. By segmenting text and video corpuses into varying granularities, the framework demonstrates improved retrieval precision and generation quality by retrieving contextually relevant data that is more suited to the specific needs of the query.

This work provides a substantial theoretical and practical contribution to the field of multimodal AI. It proposes a framework that not only equips LLMs with enhanced factual reliability by leveraging heterogeneous external knowledge sources but also paves the way for more nuanced and precise information retrieval systems. The introduction of modality and granularity awareness into RAG presents new possibilities for developing AI systems capable of robust multimodal reasoning, bridging gaps in current capabilities, and potentially extending to applications in complex real-world environments.

Moving forward, the implications of UniversalRAG could be far-reaching. The approach sets a precedent for flexible retrieval strategies in AI models that can dynamically adapt to a wide range of user inquiries across different contexts and modalities. This adaptability not only enhances the user experience in interactive AI systems but also broadens the scope of applications where AI can be reliably deployed. Future work could explore the integration of additional modalities or finer granularities, potentially extending the paradigm to incorporate audio or real-time data streams, which would further broaden the applicability and robustness of retrieval-augmented systems in diverse domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Woongyeong Yeo (2 papers)
Kangsan Kim (7 papers)
Soyeong Jeong (22 papers)
Jinheon Baek (39 papers)
Sung Ju Hwang (178 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_reachsumit/status/1917424278431228374

https://twitter.com/kangsan_kim_/status/1917766612696064383

https://twitter.com/SoyeongJeong97/status/1917519627711832263

https://twitter.com/TheTuringPost/status/1919774356294758811

https://twitter.com/ksankar77/status/1917752385231020368

https://twitter.com/arxivsanitybot/status/1917775427541602722

YouTube

Show All Videos