Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dense X Retrieval: What Retrieval Granularity Should We Use? (2312.06648v3)

Published 11 Dec 2023 in cs.CL, cs.AI, and cs.IR

Abstract: Dense retrieval has become a prominent method to obtain relevant context or world knowledge in open-domain NLP tasks. When we use a learned dense retriever on a retrieval corpus at inference time, an often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence. We discover that the retrieval unit choice significantly impacts the performance of both retrieval and downstream tasks. Distinct from the typical approach of using passages or sentences, we introduce a novel retrieval unit, proposition, for dense retrieval. Propositions are defined as atomic expressions within text, each encapsulating a distinct factoid and presented in a concise, self-contained natural language format. We conduct an empirical comparison of different retrieval granularity. Our experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks. Moreover, constructing prompts with fine-grained retrieved units for retrieval-augmented LLMs improves the performance of downstream QA tasks given a specific computation budget.

Introduction

Dense retrieval systems are integral to open-domain NLP applications. They help source relevant information by sifting through large data corpora. One crucial yet often overlooked aspect is the granularity of the retrieval unit—whether a document, passage, or sentence should be indexed and retrieved. This paper introduces a novel concept in dense retrieval that focuses on the granularity of retrieval units and its impact on the retrieval process's efficacy.

Propositions as Retrieval Units

While passages and sentences are routinely used as retrieval units, this paper proposes a different approach: using "propositions" as retrieval units. Propositions are defined as atomic expressions within the text, each elucidating a distinct factoid in a clear, standalone natural language format. Contrary to more extensive passage or complex sentence indexing, proposition indexing presents each fact as a self-contained unit, which could potentially refine retrieval quality.

Empirical Evaluation of Retrieval Granularity

An empirical comparison is drawn among different retrieval granularities utilizing a processed version of the English Wikipedia corpus, termed 'FACTOID WIKI.' This corpus is indexed at the levels of a 100-word passage, a sentence, and a proposition. The paper assesses the effectiveness of varying retrieval unit granularities through several experiments. Six different dual-encoder retrievers were tested on five open-domain QA datasets. A significant finding is that proposition-based retrieval substantially outperforms traditional passage or sentence-based methods in dense retrieval tasks.

Downstream Task Performance and Contributions

Propositional retrieval not only improves retrieval but also shows enhanced performance in downstream QA tasks. Propositions, being more condensed, provide a higher density of relevant information, hence requiring fewer input tokens and minimizing the inclusion of irrelevant content. Among the significant contributions are the proposition as a novel retrieval unit for dense retrieval and the introduction of 'FACTOID WIKI.' The paper shows proposition retrieval's generalizability and higher accuracy in downstream question-answering tasks within the same input token limit, asserting the practicality of propositions in enhancing dense retrievers' efficient information access.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Cross-domain modeling of sentence-level evidence for document retrieval. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3490–3496, Hong Kong, China. Association for Computational Linguistics.
  2. Self-rag: Learning to retrieve, generate, and critique through self-reflection.
  3. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544.
  4. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
  5. Pre-training tasks for embedding-based large-scale retrieval. In International Conference on Learning Representations.
  6. PropSegmEnt: A large-scale corpus for proposition-level segmentation and entailment recognition. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8874–8893, Toronto, Canada. Association for Computational Linguistics.
  7. Sub-sentence encoder: Contrastive learning of propositional semantic representations. arXiv preprint arXiv:2311.04335.
  8. Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one? In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 250–262, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  9. Task-aware specialization for efficient and robust dense retrieval for open-domain question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1864–1875, Toronto, Canada. Association for Computational Linguistics.
  10. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  12. Promptagator: Few-shot dense retrieval from 8 examples. In The Eleventh International Conference on Learning Representations.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. Open information extraction from the web. Communications of the ACM, 51(12):68–74.
  15. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2843–2853, Dublin, Ireland. Association for Computational Linguistics.
  16. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  17. Daniel Gildea and Daniel Jurafsky. 2000. Automatic labeling of semantic roles. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, pages 512–520, Hong Kong. Association for Computational Linguistics.
  18. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 113–122.
  19. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969.
  20. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
  21. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  22. Active retrieval augmented generation.
  23. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  24. Wice: Real-world entailment for claims in wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  25. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  26. FiE: Building a global probability space by leveraging early fusion in encoder for open-domain question answering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4246–4260, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Unifiedqa-v2: Stronger generalization via broader cross-format training. arXiv preprint arXiv:2202.12359.
  28. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.
  29. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  30. Learning to select question-relevant relations for visual question answering. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, pages 87–96, Mexico City, Mexico. Association for Computational Linguistics.
  31. Phrase retrieval learns passage retrieval, too. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3661–3672, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  32. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  33. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  34. How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  35. Lost in the middle: How language models use long contexts.
  36. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics, 9:329–345.
  37. Open-domain question answering via chain of reasoning over heterogeneous knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5360–5374, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. Chain-of-skills: A configurable model for open-domain question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1599–1618, Toronto, Canada. Association for Computational Linguistics.
  39. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  40. AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
  41. Stephen Mussmann and Stefano Ermon. 2016. Learning and inference via maximum inner product search. In International Conference on Machine Learning, pages 2587–2596. PMLR.
  42. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
  43. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268.
  44. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  45. Domain-matched pre-training tasks for dense retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1524–1534, Seattle, United States. Association for Computational Linguistics.
  46. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  47. KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  49. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  50. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  51. End-to-end training of neural retrievers for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6648–6662, Online. Association for Computational Linguistics.
  52. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3715–3734, Seattle, United States. Association for Computational Linguistics.
  53. Simple entity-centric questions challenge dense retrievers. arXiv preprint arXiv:2109.08535.
  54. Real-time open-domain question answering with dense-sparse phrase index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4430–4441, Florence, Italy. Association for Computational Linguistics.
  55. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  56. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  57. Llama 2: Open foundation and fine-tuned chat models.
  58. GPL: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2345–2360, Seattle, United States. Association for Computational Linguistics.
  59. Learning to filter context for retrieval-augmented generation.
  60. Zero-shot dense retrieval with momentum adversarial domain invariant representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 4008–4020, Dublin, Ireland. Association for Computational Linguistics.
  61. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808.
  62. Recomp: Improving retrieval-augmented lms with compression and selective augmentation.
  63. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94.
  64. Learning discriminative projections for text similarity measures. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pages 247–256, Portland, Oregon, USA. Association for Computational Linguistics.
  65. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  66. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210.
  67. Multi-view document representation learning for open-domain dense retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5990–6000, Dublin, Ireland. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tong Chen (200 papers)
  2. Hongwei Wang (150 papers)
  3. Sihao Chen (25 papers)
  4. Wenhao Yu (139 papers)
  5. Kaixin Ma (35 papers)
  6. Xinran Zhao (23 papers)
  7. Dong Yu (328 papers)
  8. Hongming Zhang (111 papers)
Citations (17)
Youtube Logo Streamline Icon: https://streamlinehq.com