Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Retrieval-Augmented Large Language Models via Data Importance Learning (2307.03027v1)

Published 6 Jul 2023 in cs.LG, cs.CL, and cs.IR

Abstract: Retrieval augmentation enables LLMs to take advantage of external knowledge, for example on tasks like question answering and data imputation. However, the performance of such retrieval-augmented models is limited by the data quality of their underlying retrieval corpus. In this paper, we propose an algorithm based on multilinear extension for evaluating the data importance of retrieved data points. There are exponentially many terms in the multilinear extension, and one key contribution of this paper is a polynomial time algorithm that computes exactly, given a retrieval-augmented model with an additive utility function and a validation set, the data importance of data points in the retrieval corpus using the multilinear extension of the model's utility function. We further proposed an even more efficient ({\epsilon}, {\delta})-approximation algorithm. Our experimental results illustrate that we can enhance the performance of LLMs by only pruning or reweighting the retrieval corpus, without requiring further training. For some tasks, this even allows a small model (e.g., GPT-JT), augmented with a search engine API, to outperform GPT-3.5 (without retrieval augmentation). Moreover, we show that weights based on multilinear extension can be computed efficiently in practice (e.g., in less than ten minutes for a corpus with 100 million elements).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. arXiv preprint arXiv:1906.08646, 2019.
  2. Zero-shot opinion summarization with gpt-3. arXiv preprint arXiv:2211.15914, 2022.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  4. Tvm: An automated end-to-end optimizing compiler for deep learning. OSDI, 2018.
  5. Viv Cothey. Web-crawling reliability. Journal of the American Society for Information Science and Technology, 55(14):1228–1238, 2004.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869, 2013.
  8. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
  9. Yili Hong. On computing the distribution function for the poisson binomial distribution. Computational Statistics & Data Analysis, 59:41–51, 2013.
  10. Efficient task-specific data valuation for nearest neighbor algorithms. arXiv preprint arXiv:1908.08619, 2019.
  11. Data debugging with shapley importance over end-to-end machine learning pipelines. arXiv preprint arXiv:2204.11131, 2022.
  12. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  13. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  14. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  15. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  16. Microsoft. Bing web search api, 2023.
  17. Can foundation models wrangle your data? PVLDB, 2022.
  18. Thomas Neumann. Efficiently compiling efficient query plans for modern hardware. Proceedings of the VLDB Endowment, 4(9):539–550, 2011.
  19. OpenAI. Models - openai, 2023.
  20. Improving language understanding by generative pre-training. 2018.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  22. The cost of training nlp models: A concise overview. 04 2020.
  23. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. arXiv preprint arXiv:2210.02627, 2022.
  24. Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  25. Energy and policy considerations for deep learning in nlp. ACL, 2019.
  26. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
  27. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  28. Wikipedia contributors. Old rambling house — Wikipedia, the free encyclopedia, 2023. [Online; accessed 25-April-2023].
  29. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, 35:25464–25477, 2022.
  30. Retrieval-enhanced machine learning. SIGIR, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiaozhong Lyu (3 papers)
  2. Stefan Grafberger (6 papers)
  3. Samantha Biegel (2 papers)
  4. Shaopeng Wei (8 papers)
  5. Meng Cao (107 papers)
  6. Sebastian Schelter (20 papers)
  7. Ce Zhang (215 papers)
Citations (11)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com