Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Retrieval: End-to-End Information Retrieval with One Large Language Model (2403.00801v2)

Published 23 Feb 2024 in cs.IR, cs.CL, and cs.AI
Self-Retrieval: End-to-End Information Retrieval with One Large Language Model

Abstract: The rise of LLMs has significantly transformed both the construction and application of information retrieval (IR) systems. However, current interactions between IR systems and LLMs remain limited, with LLMs merely serving as part of components within IR systems, and IR systems being constructed independently of LLMs. This separated architecture restricts knowledge sharing and deep collaboration between them. In this paper, we introduce Self-Retrieval, a novel end-to-end LLM-driven information retrieval architecture. Self-Retrieval unifies all essential IR functions within a single LLM, leveraging the inherent capabilities of LLMs throughout the IR process. Specifically, Self-Retrieval internalizes the retrieval corpus through self-supervised learning, transforms the retrieval process into sequential passage generation, and performs relevance assessment for reranking. Experimental results demonstrate that Self-Retrieval not only outperforms existing retrieval approaches by a significant margin, but also substantially enhances the performance of LLM-driven downstream applications like retrieval-augmented generation.

Self-Retrieval: An End-to-End LLM-driven Information Retrieval Architecture

Introduction

The integration and interaction of LLMs with Information Retrieval (IR) systems have been progressing, yet traditional IR systems lag in meeting the modern requirements posed by LLMs. This paper introduces Self-Retrieval, a novel architecture that internalizes the functionalities of an IR system entirely within a LLM. This model not only transforms how documents are indexed, retrieved, and assessed but also harnesses the innate abilities of LLMs for a more integrated retrieval process. The architecture outpaces conventional retrieval methods and demonstrates potential enhancements for downstream applications like retrieval-augmented generation.

Self-Retrieval Architecture

Self-Retrieval is structured around three core processes: indexing, retrieval through natural language indexing, and self-assessment. This approach integrates the document corpus into the LLM via self-supervised learning to create a natural language-described index within the model. Upon receiving a query, the model generates relevant documents through a two-step generative process, eventually assessing the quality of these generated documents vis-à-vis the query. This method leverages the LLM's comprehensive capabilities, including semantic understanding and generation, thereby redefining retrieval through a lens of deep semantic understanding and end-to-end processing.

Internalization and Indexing

The initial step involves internalizing the document corpus into the LLM, enabling it to build implicit index structures described in natural language. This process uses self-supervised learning to memorize documents as indexes and the broader corpus, establishing a basis for retrieval that leverages the LLM's deep understanding capabilities.

Natural Language Index-driven Retrieval

This procedure involves first generating a natural language index from the input query and then generating the relevant document based on this index. To ensure the generated content aligns closely with the corpus, the model employs constrained decoding algorithms, emphasizing the alignment with encoded documents and further enhancing retrieval accuracy.

Self-Assessment

Self-Retrieval uniquely incorporates a self-assessment phase where the model evaluates the relevance of the retrieved document to the input query. This phase utilizes pseudo-relevance feedback, allowing the LLM to assign scores to documents based on their generated relevance, ensuring a more targeted and meaningful retrieval output.

Experimental Insights

Experiments conducted on open-domain question answering tasks illustrate that Self-Retrieval significantly outperforms existing sparse, dense, and generative retrieval methods. This model not only demonstrates superior retrieval capabilities but also significantly enhances the performance of downstream tasks like retrieval augmented generation (RAG), suggesting a powerful synergy between retrieval mechanisms and generative model capabilities.

Theoretical and Practical Implications

Self-Retrieval presents a groundbreaking shift in the interaction between IR systems and LLMs, proposing a unified model that internalizes the retrieval process. Theoretically, it advances our understanding of how LLMs can be leveraged for complex retrieval tasks, integrating storage, semantic understanding, and document generation within a single model. Practically, it sets the groundwork for more sophisticated IR systems that can seamlessly integrate with LLM-driven applications, enhancing both retrieval accuracy and the efficiency of downstream processes.

Future Directions

While Self-Retrieval has demonstrated significant advancements, it also opens new avenues for research, particularly in understanding the scaling laws between document corpus sizes and model parameters, and expanding the model's application across various domains and tasks. Probing further into these areas could elucidate the full potential of LLM-driven IR systems and their impact on information access and knowledge discovery.

Conclusion

The Self-Retrieval architecture heralds a new era in information retrieval, wherein the capacities of LLMs are fully harnessed to perform end-to-end retrieval tasks. By internalizing the entire retrieval process within an LLM, it bridges the gap between traditional IR systems and the adaptive, semantic-rich capabilities of modern LLMs, offering a promising direction for future IR system development.

Limitations and Considerations

The paper also acknowledges the limitations of the current implementation of Self-Retrieval, specifically the need to explore the optimal scaling between the size of the document corpus and the parameters of the model. Moreover, the exploration of the model's effectiveness across various downstream tasks beyond retrieval augmented generation remains a fruitful area for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection.
  2. Autoregressive search engines: Generating substrings as document identifiers. In Advances in Neural Information Processing Systems, volume 35, pages 31668–31683. Curran Associates, Inc.
  3. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  4. Autoregressive entity retrieval. In International Conference on Learning Representations.
  5. Benchmarking large language models in retrieval-augmented generation.
  6. Parallel sentence mining by constrained decoding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1672–1678, Online. Association for Computational Linguistics.
  7. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  8. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
  9. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, Singapore. Association for Computational Linguistics.
  10. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  11. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  12. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
  13. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
  14. Large language models with controllable working memory. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1774–1793, Toronto, Canada. Association for Computational Linguistics.
  15. Multiview identifiers enhanced generative retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6636–6648, Toronto, Canada. Association for Computational Linguistics.
  16. Text2Event: Controllable sequence-to-structure generation for end-to-end event extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2795–2806, Online. Association for Computational Linguistics.
  17. Query rewriting in retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5303–5315, Singapore. Association for Computational Linguistics.
  18. Fine-tuning llama for multi-stage text retrieval.
  19. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005.
  20. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  21. From doc2query to doctttttquery. Online preprint, 6:2.
  22. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  23. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  24. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  25. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652.
  26. Learning to tokenize for generative retrieval. In Thirty-seventh Conference on Neural Information Processing Systems.
  27. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937, Singapore. Association for Computational Linguistics.
  28. Transformer memory as a differentiable search index. In Advances in Neural Information Processing Systems, volume 35, pages 21831–21843. Curran Associates, Inc.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  30. Stablelm 3b 4e1t.
  31. SimLM: Pre-training with representation bottleneck for dense passage retrieval. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2244–2258, Toronto, Canada. Association for Computational Linguistics.
  32. Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9414–9423, Singapore. Association for Computational Linguistics.
  33. A neural corpus indexer for document retrieval. In Advances in Neural Information Processing Systems, volume 35, pages 25600–25614. Curran Associates, Inc.
  34. C-pack: Packaged resources to advance general chinese embedding.
  35. Making retrieval-augmented language models robust to irrelevant context.
  36. Generate rather than retrieve: Large language models are strong context generators. In International Conference for Learning Representation (ICLR).
  37. Augmentation-adapted retriever improves generalization of language models as generic plug-in. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2421–2436, Toronto, Canada. Association for Computational Linguistics.
  38. Bridging the gap between indexing and retrieval for differentiable search index with query generation. arXiv preprint arXiv:2206.10128.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Qiaoyu Tang (5 papers)
  2. Jiawei Chen (160 papers)
  3. Bowen Yu (89 papers)
  4. Yaojie Lu (61 papers)
  5. Cheng Fu (12 papers)
  6. Haiyang Yu (109 papers)
  7. Hongyu Lin (94 papers)
  8. Fei Huang (408 papers)
  9. Ben He (37 papers)
  10. Xianpei Han (103 papers)
  11. Le Sun (111 papers)
  12. Yongbin Li (128 papers)
  13. Zhuoqun Li (7 papers)
Reddit Logo Streamline Icon: https://streamlinehq.com