RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models (2308.10633v2)
Abstract: Retrieval-augmented LLMs (R-LLMs) combine pre-trained LLMs with information retrieval systems to improve the accuracy of factual question-answering. However, current libraries for building R-LLMs provide high-level abstractions without sufficient transparency for evaluating and optimizing prompts within specific inference processes such as retrieval and generation. To address this gap, we present RaLLe, an open-source framework designed to facilitate the development, evaluation, and optimization of R-LLMs for knowledge-intensive tasks. With RaLLe, developers can easily develop and evaluate R-LLMs, improving hand-crafted prompts, assessing individual inference processes, and objectively measuring overall system performance quantitatively. By leveraging these features, developers can enhance the performance and accuracy of their R-LLMs in knowledge-intensive generation tasks. We open-source our code at https://github.com/yhoshi3/RaLLe.
- Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569.
- Akari Asai and Eunsol Choi. 2021. Challenges in information-seeking QA: Unanswerable questions and paragraph retrieval. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1492–1504, Online. Association for Computational Linguistics.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
- Ali Borji. 2023. A categorical archive of ChatGPT failures. arXiv preprint arXiv:2302.03494.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Harrison Chase. 2023. LangChain. https://langchain.com/.
- Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality.
- PaLM: Scaling language modeling with pathways. arxiv:2204.02311.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Nick Craswell. 2016. R-Precision, pages 1–1. Springer New York, New York, NY.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
- Benjamin Heinzerling and Kentaro Inui. 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1772–1791, Online. Association for Computational Linguistics.
- Diskann: Fast accurate billion-point nearest neighbor search on a single node. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
- June Lee. 2023. WizardVicunaLM. https://github.com/melodysdreamj/WizardVicunaLM.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
- LF Projects. 2023. MLflow – a platform for the machine learning lifecycle. https://mlflow.org/.
- Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362.
- StreamingQA: A benchmark for adaptation to new knowledge over time in question answering models. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 13604–13622. PMLR.
- Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
- Yu A. Malkov and D. A. Yashunin. 2020. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- SimplyRetrieve: A private and lightweight retrieval-centric generative ai tool. arXiv preprint arXiv:2308.03983.
- OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774, abs/2303.08774.
- KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
- What are you token about? dense retrieval as distributions over the vocabulary. arXiv preprint arXiv:2212.10380.
- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
- RePlug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- ReAct: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
- Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations.