Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications (2410.21943v1)

Published 29 Oct 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations. Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images. In this paper we describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain. The purpose of the experiments is to determine whether including images alongside text from documents within the industrial domain increases RAG performance and to find the optimal configuration for such a multimodal RAG system. Our experiments include two approaches for image processing and retrieval, as well as two LLMs (GPT4-Vision and LLaVA) for answer synthesis. These image processing strategies involve the use of multimodal embeddings and the generation of textual summaries from images. We evaluate our experiments with an LLM-as-a-Judge approach. Our results reveal that multimodal RAG can outperform single-modality RAG settings, although image retrieval poses a greater challenge than text retrieval. Additionally, leveraging textual summaries from images presents a more promising approach compared to the use of multimodal embeddings, providing more opportunities for future advancements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  3. Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv., 44(1).
  4. Harrison Chase. 2022. Langchain. https://github.com/langchain-ai/langchain. Accessed: September 09, 2024.
  5. MuRAG: Multimodal retrieval-augmented generator for open question answering over images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5558–5570, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  6. Reproducible scaling laws for contrastive language-image learning. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829.
  7. Chatbot arena: An open platform for evaluating llms by human preference. Preprint, arXiv:2403.04132.
  8. Chroma. 2022. Chroma: the open-source embedding database. https://github.com/chroma-core/chroma. Accessed: September 09, 2024.
  9. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 719–729, New York, NY, USA. Association for Computing Machinery.
  10. RAGAs: Automated evaluation of retrieval augmented generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, St. Julians, Malta. Association for Computational Linguistics.
  11. Retrieval-augmented generation for large language models: A survey. ArXiv, abs/2312.10997.
  12. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  13. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  14. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  15. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
  16. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  17. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  18. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  19. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11238–11254, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  20. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306.
  21. Query rewriting in retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5303–5315, Singapore. Association for Computational Linguistics.
  22. Yu A. Malkov and D. A. Yashunin. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836.
  23. Meta LLaMA Team. 2024. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-06-30.
  24. RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, Bangkok, Thailand. Association for Computational Linguistics.
  25. OpenAI. 2023a. GPT-4 technical report.
  26. OpenAI. 2023b. Gpt-4v(ision) system card.
  27. OpenAI. 2024. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/. Accessed: September 22, 2024.
  28. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  29. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  30. The troubling emergence of hallucination in large language models - an extensive definition, quantification, and prescriptive remediations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2541–2573, Singapore. Association for Computational Linguistics.
  31. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17.
  32. Fact-aware multimodal retrieval augmentation for accurate medical radiology report generation. ArXiv, abs/2407.15268.
  33. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA. Curran Associates Inc.
  34. Clasheval: Quantifying the tug-of-war between an llm’s internal prior and external evidence. Preprint, arXiv:2404.10198.
  35. Rule: Reliable multimodal rag for factuality in medical vision language models. ArXiv, abs/2407.05131.
  36. A survey on multimodal large language models. ArXiv, abs/2306.13549.
  37. MM-LLMs: Recent advances in MultiModal large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 12401–12430, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
  38. Gpt-4v(ision) as a generalist evaluator for vision-language tasks. ArXiv, abs/2311.01361.
  39. Emerge: Integrating rag for improved multimodal ehr predictive modeling. ArXiv, abs/2406.00036.
Citations (1)

Summary

  • The paper demonstrates that incorporating visual data with text significantly improves RAG system performance in industrial applications.
  • It employs dual image processing strategies—using CLIP embeddings and image-generated summaries—and evaluates them with metrics like Answer Correctness and Faithfulness.
  • The study reveals that advanced MLLMs, notably GPT-4V, outperform alternatives, paving the way for more effective domain-specific AI solutions.

Multimodal Retrieval Augmented Generation for Industrial Applications

The paper "Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications" explores the integration of multimodal LLMs (MLLMs) into Retrieval Augmented Generation (RAG) systems, specifically tailored for the industrial domain. The primary concern addressed is whether incorporating images alongside text enhances the RAG system's ability to comprehend and respond to domain-specific questions.

Overview

LLMs have proven capabilities in general language tasks, yet their effectiveness diminishes when applied to domain-specific scenarios due to limitations like hallucinations and inadequate domain-specific knowledge. Retrieval Augmented Generation (RAG) systems, which combine document retrieval with generative models, are highlighted as a method to mitigate these limitations by incorporating domain-specific data.

The document introduces a thorough investigation into the application of multimodal RAG in industrial contexts, where both textual and visual data are prevalent, such as in manuals and technical documentation. The authors conducted several experiments, utilizing two different image processing and retrieval strategies, along with two MLLMs—GPT4-Vision and LLaVA—for answer synthesis. Their approach sought to discern the optimal configuration for multimodal RAG systems in industrial settings.

Key Experimental Insights

In assessing the paper's methodological core, several points of technical interest emerge:

  1. Image Processing Strategies: Two strategies were employed for handling image data: multimodal embeddings using CLIP and textual summaries generated from images via MLLMs. The paper compares these strategies in the context of performance within RAG pipelines.
  2. Evaluation Methodology: The performance of the proposed configurations was evaluated using an innovative LLM-as-a-Judge approach, benchmarking both the retrieval and generation components of the system. Key metrics included Answer Correctness, Answer Relevancy, and various measures of Faithfulness and Context Relevance.
  3. Results and Observations: The results demonstrated that multimodal RAG systems, which integrate both text and images, generally outperform single-modality systems, especially when leveraging image summaries over multimodal embeddings. However, image retrieval itself remains a more daunting challenge compared to text retrieval.
  4. Model Comparisons: GPT-4V consistently outperformed LLaVA in terms of Answer Correctness and Text Faithfulness. The paper reports nuanced insights into the capabilities of the MLLMs when handling multiple images, showcasing benefits in contextual comprehension.

Implications and Future Directions

The implications of this research extend both theoretically and practically. The findings contribute to the growing body of knowledge surrounding multimodal RAG, emphasizing its potential in enhancing domain-specific AI applications. Practically, this has significance in industries reliant on the convergence of textual and visual documentation, such as manufacturing and engineering, where AI systems need to interpret and synthesize multimodal data effectively.

For future research, the paper suggests efforts to refine image retrieval processes further, potentially through domain-specific adaptations and fine-tuning methodologies. The authors propose exploring the integration of RAG systems with fine-tuned MLLMs dedicated to industrial tasks, which could yield improvements in both accuracy and applicability.

The outlined research presents a compelling case for the integration of multimodal data into RAG systems, highlighting the nuanced benefits and challenges associated with such an endeavor. By advancing multimodal systems, the paper provides a framework that could significantly impact AI's role in industrial applications, offering pathways for more robust domain-specific AI tools. The meticulous approach and detailed evaluation ensure that the insights gained are grounded in rigorous experimental evidence, pointing the way forward for further developments in this intersection of AI domains.

Youtube Logo Streamline Icon: https://streamlinehq.com