Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EHI: End-to-end Learning of Hierarchical Index for Efficient Dense Retrieval (2310.08891v2)

Published 13 Oct 2023 in cs.LG and cs.IR

Abstract: Dense embedding-based retrieval is widely used for semantic search and ranking. However, conventional two-stage approaches, involving contrastive embedding learning followed by approximate nearest neighbor search (ANNS), can suffer from misalignment between these stages. This mismatch degrades retrieval performance. We propose End-to-end Hierarchical Indexing (EHI), a novel method that directly addresses this issue by jointly optimizing embedding generation and ANNS structure. EHI leverages a dual encoder for embedding queries and documents while simultaneously learning an inverted file index (IVF)-style tree structure. To facilitate the effective learning of this discrete structure, EHI introduces dense path embeddings that encodes the path traversed by queries and documents within the tree. Extensive evaluations on standard benchmarks, including MS MARCO (Dev set) and TREC DL19, demonstrate EHI's superiority over traditional ANNS index. Under the same computational constraints, EHI outperforms existing state-of-the-art methods by +1.45% in MRR@10 on MS MARCO (Dev) and +8.2% in nDCG@10 on TREC DL19, highlighting the benefits of our end-to-end approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  2. Erik Bernhardsson. Annoy: Approximate Nearest Neighbors in C++/Python, 2018. URL https://pypi.org/project/annoy/. Python package version 1.13.0.
  3. Autoregressive search engines: Generating substrings as document identifiers. Advances in Neural Information Processing Systems, 35:31668–31683, 2022.
  4. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pp.  97–104, 2006.
  5. Deng Cai. A revisit of hashing algorithms for approximate nearest neighbor search. IEEE Transactions on Knowledge and Data Engineering, 33(6):2337–2348, 2021. doi: 10.1109/TKDE.2019.2953897.
  6. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pp.  539–546. IEEE, 2005.
  7. Kenneth L Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the tenth annual symposium on Computational geometry, pp.  160–164, 1994.
  8. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820, 2020.
  9. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pp.  31–39, 2021.
  10. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp.  253–262, 2004.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. In Proceedings of the 2018 world wide web conference, pp. 1775–1784, 2018.
  13. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209–226, 1977.
  14. Complementing lexical retrieval with semantic residual embedding. corr abs/2004.13969 (2020). arXiv preprint arXiv:2004.13969, 2020.
  15. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence, 36(4):744–755, 2013.
  16. Accelerating large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning, pp. 3887–3896. PMLR, 2020.
  17. Elias: End-to-end learning to index and search in large output spaces. Advances in Neural Information Processing Systems, 35:19798–19809, 2022.
  18. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp.  297–304. JMLR Workshop and Conference Proceedings, 2010.
  19. Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
  20. Huggingface. HuggingFace Sentence-transformers for MSMarco, 2019. URL https://huggingface.co/sentence-transformers/msmarco-distilbert-cos-v5.
  21. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp.  604–613, 1998.
  22. Distilling knowledge from reader to retriever for question answering. arXiv preprint arXiv:2012.04584, 2020a.
  23. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020b.
  24. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  25. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  26. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp.  528–536, 2019.
  27. Ood-diskann: Efficient and scalable graph anns for out-of-distribution queries. arXiv preprint arXiv:2211.12850, 2022.
  28. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems, 32, 2019.
  29. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
  30. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  31. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48, 2020.
  32. Big transfer (bit): General visual representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pp.  491–507. Springer, 2020.
  33. Stochastic re-weighted gradient descent via distributionally robust optimization. arXiv preprint arXiv:2306.09222, 2023.
  34. Llc: Accurate, multi-purpose learnt low-dimensional binary codes. Advances in Neural Information Processing Systems, 34:23900–23913, 2021.
  35. Matryoshka representation learning. Advances in Neural Information Processing Systems, 35:30233–30249, 2022.
  36. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  37. Approximate nearest neighbor search on high dimensional data—experiments, analyses, and improvement. IEEE Transactions on Knowledge and Data Engineering, 2020.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pp. 1941–1942, 2018.
  40. Yu A Malkov and DA Yashunin. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis & Machine Intelligence, 42(04):824–836, 2020.
  41. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
  42. In defense of dual-encoders for neural ranking. In International Conference on Machine Learning, pp. 15376–15400. PMLR, 2022.
  43. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 13(1):1–126, 2018.
  44. Improving dual-encoder training through dynamic indexes for negative mining. In International Conference on Artificial Intelligence and Statistics, pp.  9308–9330. PMLR, 2023.
  45. Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022.
  46. Pandu Nayak. Understanding searches better than ever before. Google AI Blog, 2019. URL https://blog.google/products/search/search-language-understanding-bert/.
  47. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  48. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
  49. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.
  50. Document expansion by query prediction. arXiv preprint arXiv:1904.08375, 2019.
  51. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191, 2020.
  52. Improving language understanding by generative pre-training. OpenAI, 2018.
  53. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763, 2021.
  54. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  55. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  56. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
  57. Video google: A text retrieval approach to object matching in videos. In Computer Vision, IEEE International Conference on, volume 3, pp.  1470–1470. IEEE Computer Society, 2003.
  58. Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems, 35:21831–21843, 2022.
  59. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
  60. Fact or fiction: Verifying scientific claims. arXiv preprint arXiv:2004.14974, 2020.
  61. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proceedings of the VLDB Endowment, 14(11):1964–1978, 2021.
  62. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems, 35:25600–25614, 2022.
  63. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, volume 98, pp.  194–205, 1998.
  64. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ramnath Kumar (6 papers)
  2. Anshul Mittal (9 papers)
  3. Nilesh Gupta (6 papers)
  4. Aditya Kusupati (28 papers)
  5. Inderjit Dhillon (25 papers)
  6. Prateek Jain (131 papers)

Summary

We haven't generated a summary for this paper yet.