Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ESPN: Memory-Efficient Multi-Vector Information Retrieval (2312.05417v1)

Published 9 Dec 2023 in cs.IR and cs.LG

Abstract: Recent advances in LLMs have demonstrated remarkable effectiveness in information retrieval (IR) tasks. While many neural IR systems encode queries and documents into single-vector representations, multi-vector models elevate the retrieval quality by producing multi-vector representations and facilitating similarity searches at the granularity of individual tokens. However, these models significantly amplify memory and storage requirements for retrieval indices by an order of magnitude. This escalation in index size renders the scalability of multi-vector IR models progressively challenging due to their substantial memory demands. We introduce Embedding from Storage Pipelined Network (ESPN) where we offload the entire re-ranking embedding tables to SSDs and reduce the memory requirements by 5-16x. We design a software prefetcher with hit rates exceeding 90%, improving SSD based retrieval up to 6.4x, and demonstrate that we can maintain near memory levels of query latency even for large query batch sizes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Deepspeed- inference: Enabling efficient inference of transformer models at unprecedented scale. SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2022. URL https://api.semanticscholar.org/CorpusID:250243681.
  2. {{\{{FlashNeuron}}\}}:{{\{{SSD-Enabled}}\}}{{\{{Large-Batch}}\}} training of very deep neural networks. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pp.  387–401, 2021.
  3. Ms marco: A human generated machine reading comprehension dataset, 2018.
  4. Spann: Highly-efficient billion-scale approximate nearest neighbor search. In 35th Conference on Neural Information Processing Systems (NeurIPS 2021), 2021.
  5. Cheriton, D. R. From doc2query to doctttttquery. 2019. URL https://api.semanticscholar.org/CorpusID:208612557.
  6. SDR: Efficient neural re-ranking using succinct document representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6624–6637, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.457. URL https://aclanthology.org/2022.acl-long.457.
  7. Are you sure you want to use mmap in your database management system? In CIDR 2022, Conference on Innovative Data Systems Research, 2022.
  8. Context-aware sentence/passage term importance estimation for first stage retrieval, 2019.
  9. Evolution of the graphics processing unit (gpu). IEEE Micro, 41(6):42–51, 2021. doi: 10.1109/MM.2021.3113475.
  10. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  11. With shared microexponents, a little shifting goes a long way. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400700958. doi: 10.1145/3579371.3589351. URL https://doi.org/10.1145/3579371.3589351.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  13. Splade v2: Sparse lexical and expansion model for information retrieval. ArXiv, abs/2109.10086, 2021. URL https://api.semanticscholar.org/CorpusID:237581550.
  14. Unsupervised corpus aware language model pre-training for dense passage retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2843–2853, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.203. URL https://aclanthology.org/2022.acl-long.203.
  15. COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3030–3042, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.241. URL https://aclanthology.org/2021.naacl-main.241.
  16. GDSIO. Nvidia gpudirect storage benchmarking and configuration guide, 2023. URL https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html.
  17. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016.
  18. Distilling the knowledge in a neural network, 2015.
  19. Introducing neural bag of whole-words with colberter: Contextualized late interactions using enhanced reduction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, pp.  737–747, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392365. doi: 10.1145/3511808.3557367. URL https://doi.org/10.1145/3511808.3557367.
  20. A memory efficient baseline for open domain question answering. CoRR, abs/2012.15156, 2020. URL https://arxiv.org/abs/2012.15156.
  21. Billion-scale similarity search with gpus. CoRR, abs/1702.08734, 2017. URL http://arxiv.org/abs/1702.08734.
  22. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, ISCA ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400700958. doi: 10.1145/3579371.3589350. URL https://doi.org/10.1145/3579371.3589350.
  23. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57.
  24. Dense passage retrieval for open-domain question answering. CoRR, abs/2004.04906, 2020. URL https://arxiv.org/abs/2004.04906.
  25. Kerrisk, M. J. cgroups.7 - linux manual page. https://man7.org/linux/man-pages/man7/cgroups.7.html, 2021a.
  26. Kerrisk, M. J. mlock(2) — linux manual page. https://man7.org/linux/man-pages/man2/mlock.2.html, 2021b.
  27. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, pp.  39–48, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450380164. doi: 10.1145/3397271.3401075. URL https://doi.org/10.1145/3397271.3401075.
  28. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pp.  611–626, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702297. doi: 10.1145/3600006.3613165. URL https://doi.org/10.1145/3600006.3613165.
  29. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  30. Citadel: Conditional token interaction via dynamic lexical routing for efficient and effective multi-vector retrieval, 2022.
  31. A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques, 2021.
  32. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pp.  2356–2362, 2021.
  33. Pre-trained language model for web-scale retrieval in baidu search, 2021.
  34. Dimension reduction for efficient dense retrieval via conditional autoencoder, 2022.
  35. Sparse, dense, and attentional representations for text retrieval. CoRR, abs/2005.00181, 2020. URL https://arxiv.org/abs/2005.00181.
  36. Wacky weights in learned sparse representations and the revenge of score-at-a-time query evaluation, 2021.
  37. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(4):824–836, 2020. doi: 10.1109/TPAMI.2018.2889473.
  38. Learning passage impacts for inverted indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, pp.  1723–1727, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3463030. URL https://doi.org/10.1145/3404835.3463030.
  39. Principal components analysis (pca). Computers & Geosciences, 19(3):303–342, 1993. ISSN 0098-3004. doi: https://doi.org/10.1016/0098-3004(93)90090-R. URL https://www.sciencedirect.com/science/article/pii/009830049390090R.
  40. Fp8 formats for deep learning, 2022.
  41. Msmarcov2. Trec 2023 deep learning track, 2023. URL https://microsoft.github.io/msmarco/TREC-Deep-Learning.html.
  42. Multi-stage document ranking with BERT. CoRR, abs/1910.14424, 2019. URL http://arxiv.org/abs/1910.14424.
  43. NvidiaGDS. Nvidia gpudirect storage, 2023. URL https://docs.nvidia.com/gpudirect-storage/index.html.
  44. Deep contextualized word representations. CoRR, abs/1802.05365, 2018. URL http://arxiv.org/abs/1802.05365.
  45. Efficiently scaling transformer inference, 2022.
  46. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5835–5847, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.466. URL https://aclanthology.org/2021.naacl-main.466.
  47. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In ICML 2022, January 2022.
  48. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2825–2835, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.224. URL https://aclanthology.org/2021.emnlp-main.224.
  49. The probabilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, apr 2009. ISSN 1554-0669. doi: 10.1561/1500000019. URL https://doi.org/10.1561/1500000019.
  50. Samsung, 2022. URL https://download.semiconductor.samsung.com/resources/data-sheet/Samsung_NVMe_SSD_990_PRO_Datasheet_Rev.1.0.pdf.
  51. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. URL https://api.semanticscholar.org/CorpusID:203626972.
  52. Plaid: An efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, CIKM ’22, pp.  1747–1756, New York, NY, USA, 2022a. Association for Computing Machinery. ISBN 9781450392365. doi: 10.1145/3511808.3557325. URL https://doi.org/10.1145/3511808.3557325.
  53. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3715–3734, Seattle, United States, July 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.272. URL https://aclanthology.org/2022.naacl-main.272.
  54. Sharma, D. D. Compute express link®: An open industry-standard interconnect enabling heterogeneous data-centric computing. In 2022 IEEE Symposium on High-Performance Interconnects (HOTI), pp.  5–12, 2022. doi: 10.1109/HOTI55740.2022.00017.
  55. An introduction to the compute express link (cxl) interconnect, 2023.
  56. Diskann: Fast accurate billion-point nearest neighbor search on a single node. In NeurIPS 2019, November 2019.
  57. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  58. Recssd: Near data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’21, pp.  717–729, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383172. doi: 10.1145/3445814.3446763. URL https://doi.org/10.1145/3445814.3446763.
  59. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=zeFrfgyZln.
  60. Jointly optimizing query encoder and product quantization to improve retrieval performance. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, pp.  2487–2496, New York, NY, USA, 2021a. Association for Computing Machinery. ISBN 9781450384469. doi: 10.1145/3459637.3482358. URL https://doi.org/10.1145/3459637.3482358.
  61. Optimizing dense retrieval model training with hard negatives. CoRR, abs/2104.08051, 2021b. URL https://arxiv.org/abs/2104.08051.
  62. Learning discrete representations via constrained clustering for effective and efficient dense retrieval. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, pp.  1328–1336, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450391320. doi: 10.1145/3488560.3498443. URL https://doi.org/10.1145/3488560.3498443.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Susav Shrestha (3 papers)
  2. Narasimha Reddy (5 papers)
  3. Zongwang Li (5 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com