Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling (2402.13542v2)

Published 21 Feb 2024 in cs.CL, cs.AI, cs.IR, and cs.LG
ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling

Abstract: Retrieval-augmented generation enhances LLMs by incorporating relevant information from external knowledge sources. This enables LLMs to adapt to specific domains and mitigate hallucinations in knowledge-intensive tasks. However, existing retrievers are often misaligned with LLMs due to their separate training processes and the black-box nature of LLMs. To address this challenge, we propose ARL2, a retriever learning technique that harnesses LLMs as labelers. ARL2 leverages LLMs to annotate and score relevant evidence, enabling learning the retriever from robust LLM supervision. Furthermore, ARL2 uses an adaptive self-training strategy for curating high-quality and diverse relevance data, which can effectively reduce the annotation cost. Extensive experiments demonstrate the effectiveness of ARL2, achieving accuracy improvements of 5.4% on NQ and 4.6% on MMLU compared to the state-of-the-art methods. Additionally, ARL2 exhibits robust transfer learning capabilities and strong zero-shot generalization abilities. Our code will be published at \url{https://github.com/zhanglingxi-cs/ARL2}.

Leveraging LLMs for Enhanced Retrieval-Augmented Task Performance with Arl2

Introduction to Arl2

Arl2 represents a novel approach in the space of retrieval-augmented generation (RAG) by addressing the critical challenge of aligning retrievers with LLMs. This challenge stems from the traditional training processes of retrievers and LLMs, which are often siloed, resulting in a misalignment that hinders the effective utilization of external knowledge sources. Arl2 introduces a methodology that closes this gap by employing LLMs as annotators for generating relevance labels, a strategy designed to train retrievers under robust LLM supervision. This approach not only enhances the alignment between retrievers and LLMs but also offers an adaptive self-training strategy aimed at curating high-quality and diverse relevance data. The ultimate goal is to mitigate the annotation cost while improving the performance of LLMs across various knowledge-intensive tasks.

Methodology Overview

Arl2 stands out through its self-guided adaptive relevance labeling, a significant departure from indirect supervision methods, allowing for direct assessment of document relevance by LLMs. The core aspects of Arl2’s methodology include:

  • Data Construction: By leveraging the LLM to create training tuples, Arl2 ensures the generation of diverse questions and evidence, supported by relevance scores. This approach facilitates the training of a retriever model with the ability to discern truly relevant documents from a pool of similar but irrelevant ones.
  • Retrieval Model Learning: Incorporating a learning objective that includes both pairwise and list-wise losses, Arl2 fine-tunes the retriever to improve its performance significantly. Crucially, the adaptive relevance labeling strategy allows for the efficient use of LLM-generated annotations, reducing dependence on costly LLM interactions.
  • Inference Process: At this stage, Arl2 emphasizes the importance of effectively augmenting support evidence for LLMs. This involves reordering documents based on relevance and employing an ensemble method to calculate the final answer score, ensuring robustness in the LLM’s utilization of external knowledge.

Experimental Insights

Empirical evaluation across various datasets, including open-domain question answering (QA) tasks like Natural Questions (NQ) and specific-domain QA tasks like the Massive Multitask Language Understanding (MMLU), demonstrates Arl2’s superior performance. Key findings include:

  • Arl2 records accuracy improvements of 5.4% on NQ and 4.6% on MMLU, showcasing notable gains over state-of-the-art methods.
  • The framework exhibits strong transfer learning capabilities, also delivering promising results in zero-shot generalization settings.
  • The adaptivity of Arl2 to specific domains, underpinned by its self-training strategy and the generation of diverse questions, further underscores its utility in enhancing the performance of LLMs.

Theoretical and Practical Implications

Arl2’s approach to leveraging LLMs for relevance labeling and retriever training has both theoretical and practical implications. Theoretically, it presents a unique perspective on aligning retrieval processes with the intrinsic capabilities of LLMs, challenging existing paradigms in the field. Practically, the methodology introduces a cost-effective mechanism for enhancing LLM performance across a range of tasks, offering insights that could be pivotal for future developments in AI.

Future Directions

While Arl2 marks a significant advancement in retrieval-augmented generation, there are avenues for further exploration. Enhancing efficiency in relevance data curation, expanding the diversity of the training data, and exploring the framework’s applicability to more specialized domains are potential areas for future research. These endeavors will deepen our understanding of the interplay between retrievers and LLMs, paving the way for more nuanced and effective AI systems.

Conclusion

Arl2 embodies a significant stride toward resolving the challenge of misalignment between retrievers and LLMs in the context of retrieval-augmented generation. Through its innovative use of LLMs for relevance labeling and an adaptive self-training strategy, it sets the stage for further innovations in aligning external knowledge sources with the evolving capabilities of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Palm 2 technical report. ArXiv preprint, abs/2305.10403.
  2. Ms marco: A human generated machine reading comprehension dataset. ArXiv preprint, abs/1611.09268.
  3. ISODATA, a novel method of data analysis and pattern classification, volume 4. Stanford research institute Menlo Park, CA.
  4. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pages 2206–2240. PMLR.
  5. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374.
  6. Lift yourself up: Retrieval-augmented text generation with self-memory. In Thirty-seventh Conference on Neural Information Processing Systems.
  7. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
  8. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Luyu Gao and Jamie Callan. 2021. Condenser: a pre-training architecture for dense retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 981–993, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853, Dublin, Ireland. Association for Computational Linguistics.
  12. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  13. End-to-end retrieval in continuous space. ArXiv preprint, abs/1811.08008.
  14. Unicron: Economizing self-healing llm training at scale. ArXiv preprint, abs/2401.00134.
  15. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  16. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems.
  17. Learning deep structured semantic models for web search using clickthrough data. In 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pages 2333–2338. ACM.
  18. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  19. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  20. Few-shot learning with retrieval augmented language models. ArXiv preprint, abs/2208.03299.
  21. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  22. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  23. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  24. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  25. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
  26. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  27. Internet-augmented language models through few-shot prompting for open-domain question answering. ArXiv preprint, abs/2203.05115.
  28. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, Florence, Italy. Association for Computational Linguistics.
  29. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  30. CoAnnotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1487–1505, Singapore. Association for Computational Linguistics.
  31. BERT is not the count: Learning to match mathematical statements with proofs. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3581–3593, Dubrovnik, Croatia. Association for Computational Linguistics.
  32. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6385–6400, Singapore. Association for Computational Linguistics.
  33. Ra-dit: Retrieval-augmented dual instruction tuning. ArXiv preprint, abs/2310.01352.
  34. Lost in the middle: How language models use long contexts. ArXiv preprint, abs/2307.03172.
  35. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  36. Zero-shot neural passage retrieval via domain-targeted synthetic question generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1075–1088, Online. Association for Computational Linguistics.
  37. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. OpenAI. 2023. Gpt-4 technical report.
  39. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
  40. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  41. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  42. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  43. Replug: Retrieval-augmented black-box language models. ArXiv preprint, abs/2301.12652.
  44. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 1–10.
  45. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937, Singapore. Association for Computational Linguistics.
  46. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
  47. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
  48. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
  49. RetroMAE: Pre-training retrieval-oriented language models via masked auto-encoder. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 538–548, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  50. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  51. Retrieval-augmented domain adaptation of language models. In Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), pages 54–64, Toronto, Canada. Association for Computational Linguistics.
  52. Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2501–2505.
  53. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  54. Chain-of-note: Enhancing robustness in retrieval-augmented language models. ArXiv preprint, abs/2311.09210.
  55. COCO-DR: Combating distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1462–1479, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  56. Cold-start data selection for better few-shot language model fine-tuning: A prompt-based uncertainty propagation approach. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2499–2521, Toronto, Canada. Association for Computational Linguistics.
  57. Augmentation-adapted retriever improves generalization of language models as generic plug-in. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2421–2436, Toronto, Canada. Association for Computational Linguistics.
  58. Adversarial retriever-ranker for dense text retrieval. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Lingxi Zhang (6 papers)
  2. Yue Yu (343 papers)
  3. Kuan Wang (30 papers)
  4. Chao Zhang (907 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com