Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers (2404.07220v2)

Published 22 Mar 2024 in cs.IR, cs.AI, and cs.CL
Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers

Abstract: Retrieval-Augmented Generation (RAG) is a prevalent approach to infuse a private knowledge base of documents with LLMs (LLM) to build Generative Q&A (Question-Answering) systems. However, RAG accuracy becomes increasingly challenging as the corpus of documents scales up, with Retrievers playing an outsized role in the overall RAG accuracy by extracting the most relevant document from the corpus to provide context to the LLM. In this paper, we propose the 'Blended RAG' method of leveraging semantic search techniques, such as Dense Vector indexes and Sparse Encoder indexes, blended with hybrid query strategies. Our study achieves better retrieval results and sets new benchmarks for IR (Information Retrieval) datasets like NQ and TREC-COVID datasets. We further extend such a 'Blended Retriever' to the RAG system to demonstrate far superior results on Generative Q&A datasets like SQUAD, even surpassing fine-tuning performance.

Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers

The paper "Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers" by Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki addresses the limitations of conventional Retrieval-Augmented Generation (RAG) systems as the corpus of documents scales. The authors propose the 'Blended RAG' method, which leverages semantic search techniques and hybrid query strategies to enhance the accuracy of document retrieval within RAG systems.

Introduction

RAG combines generative models with retrievers that sift through external knowledge bases to provide contextually relevant information to the generation component. The accuracy of RAG systems predominantly hinges on the efficiency of the retriever. Traditional methods, which rely heavily on keyword and similarity-based searches, often fall short when dealing with large and complex datasets. This paper proposes a more nuanced approach by blending dense vector indexes and sparse encoder indexes with hybrid query strategies.

Related Work

Historically, the BM25 algorithm has been central to information retrieval (IR), exploiting Term Frequency (TF), Inverse Document Frequency (IDF), and document length to compute relevance scores. However, dense vector models using KNN algorithms have shown superiority in capturing deep semantic relationships. Sparse encoder-based models offer an efficient alternative with better precision in high-dimensional data representation.

Limitations of Current RAG Systems

Current RAG systems struggle with accuracy due to over-reliance on keyword and similarity-based searches. As shown in Table 1 of the paper, benchmarks such as NDCG@10 and F1 scores illustrate the limitations of retriever accuracy. Typically, efforts to improve RAG systems focus on fine-tuning the generator, yet this does not address the critical issue of retrieving relevant documents.

Methodology: Blended Retrievers

The authors explored three distinct search strategies: BM25 for keyword-based, KNN for dense vector-based, and Elastic Learned Sparse Encoder (ELSER) for sparse encoder-based semantic search. Figure 1 in the paper outlines a systematic evaluation approach across these indices with hybrid queries categorized into cross-fields, most fields, best fields, and phrase prefix types.

Experimentation and Results

Retriever Evaluation

The evaluation focused on top-k retrieval accuracy metrics across datasets like NQ, TREC-COVID, SQUAD, and HotPotQA. Notably, the Sparse Encoder with Best Fields hybrid query consistently outperformed other methods. For instance, on the NQ dataset, it achieved an 88.77% retrieval accuracy. Similarly, on the TREC-COVID dataset, it reached an impressive 98% top-10 retrieval accuracy when documents were highly relevant.

RAG System Evaluation

Using FLAN-T5-XXL, the Blended RAG system surpassed previous benchmarks without dataset-specific fine-tuning. On the SQuAD dataset, it achieved 68.4% in F1 scores and 57.63% in Exact Match (EM). For the NQ dataset, it reached an EM score of 42.63, outperforming other models by significant margins.

Implications and Future Work

The main implication of this research is the demonstrated efficacy of integrating advanced retrievers with hybrid query formulations over simply scaling LLMs. This approach suggests potential improvements in various applications, from enterprise search systems to conversational AI.

Future research could explore more intricate hybrid queries and evaluate the Blended RAG system across additional datasets. Additionally, the development of better metrics beyond NDCG@10 and F1, which more closely align with human judgment, remains an open area.

Conclusion

The Blended RAG approach significantly optimizes both retriever and RAG system accuracy by integrating semantic search and hybrid queries. This method establishes a new standard in IR benchmarks, emphasizing the importance of sophisticated retrievers in enhancing generative Q&A systems.

The paper lays a strong foundation for further explorations into synergizing dense and sparse indexing approaches, setting the stage for improvements in retrieval-augmented generative architectures that can extend across diverse informational contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. S. Robertson and H. Zaragoza, “The bm25 algorithm,” Foundations and Trends in Information Retrieval, 2009.
  2. M. Johnson et al., “Knn algorithms for semantic search,” in Proceedings of the International Conference on Machine Learning, 2019.
  3. Boston, MA: Springer US, 2009.
  4. K. Taunk, S. De, S. Verma, and A. Swetapadma, “A brief review of nearest neighbor algorithm for learning and classification,” in 2019 International Conference on Intelligent Computing and Control Systems (ICCS), pp. 1255–1260, 2019.
  5. T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov, “Natural questions: a benchmark for question answering research,” Transactions of the Association of Computational Linguistics, 2019.
  6. L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y. Katsis, R. Kinney, et al., “Cord-19: The covid-19 open research dataset,” ArXiv, 2020.
  7. Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi-hop question answering,” arXiv preprint arXiv:1809.09600, 2018.
  8. Y. Wang, L. Wang, Y. Li, D. He, and T.-Y. Liu, “A theoretical analysis of ndcg type ranking measures,” in Conference on learning theory, pp. 25–54, PMLR, 2013.
  9. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016.
  10. S. Reddy, D. Chen, and C. D. Manning, “Coqa: A conversational question answering challenge,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 249–266, 2019.
  11. S. Siriwardhana, R. Weerasekera, E. Wen, T. Kaluarachchi, R. Rana, and S. Nanayakkara, “Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 1–17, 2023.
  12. N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning, pp. 5547–5569, PMLR, 2022.
  13. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kunal Sawarkar (4 papers)
  2. Abhilasha Mangal (2 papers)
  3. Shivam Raj Solanki (1 paper)
Citations (20)