Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bottleneck-Minimal Indexing for Generative Document Retrieval (2405.10974v2)

Published 12 May 2024 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: We apply an information-theoretic perspective to reconsider generative document retrieval (GDR), in which a document $x \in X$ is indexed by $t \in T$, and a neural autoregressive model is trained to map queries $Q$ to $T$. GDR can be considered to involve information transmission from documents $X$ to queries $Q$, with the requirement to transmit more bits via the indexes $T$. By applying Shannon's rate-distortion theory, the optimality of indexing can be analyzed in terms of the mutual information, and the design of the indexes $T$ can then be regarded as a {\em bottleneck} in GDR. After reformulating GDR from this perspective, we empirically quantify the bottleneck underlying GDR. Finally, using the NQ320K and MARCO datasets, we evaluate our proposed bottleneck-minimal indexing method in comparison with various previous indexing methods, and we show that it outperforms those methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Berger, T. Rate distortion theory for sources with abstract alphabets and memory. Information and Control, 13:254–273, 1968.
  2. Autoregressive search engines: Generating substrings as document identifiers. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Z4kZxAjg8Y.
  3. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, pp.  89–96, 2005.
  4. Autoregressive entity retrieval. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=5k8F6UU39V.
  5. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp.  129–136, 2007.
  6. Variational deep semantic hashing for text documents. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  75–84, 2017.
  7. Corpusbrain: Pre-train a generative retrieval model for knowledge-intensive language tasks. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp.  191–200, 2022.
  8. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp.  253–262, 2004.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  10. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp.  214–229. Springer, 2020.
  11. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems, pp.  299–315, 2022.
  12. Gray, R. Vector quantization. IEEE Assp Magazine, 1(2):4–29, 1984.
  13. Grefenstette, G. Cross-language information retrieval, volume 2. Springer Science & Business Media, 2012.
  14. Bliss: A billion scale index using iterative re-partitioning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  486–495, 2022.
  15. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979.
  16. Language models as semantic indexers. arXiv preprint arXiv:2310.07815, 2023.
  17. Unsupervised semantic deep hashing. Neurocomputing, 351:19–25, 2019.
  18. Dense passage retrieval for open-domain question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  19. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48, 2020.
  20. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  21. Learning to rank in generative retrieval. arXiv preprint arXiv:2306.15222, 2023.
  22. Deep unsupervised hashing with latent semantic components. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  7488–7496, 2022.
  23. Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  24. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  25. Knowledge distillation for high dimensional search index. Advances in Neural Information Processing Systems, 36, 2023.
  26. Foundations of statistical natural language processing. MIT press, 1999.
  27. DSI++: Updating transformer memory with new documents. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  8198–8213, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.510. URL https://aclanthology.org/2023.emnlp-main.510.
  28. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  29. Generative retrieval as dense retrieval. arXiv preprint arXiv:2306.11397, 2023.
  30. MS MARCO: A human generated machine reading comprehension dataset. In Besold, T. R., Bordes, A., d’Avila Garcez, A. S., and Wayne, G. (eds.), Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org, 2016. URL https://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf.
  31. Scalable recognition with a vocabulary tree. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  2161–2168. Ieee, 2006.
  32. From doc2query to doctttttquery. Online preprint, 6:2, 2019.
  33. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  34. Recommender systems with generative retrieval. Advances in Neural Information Processing Systems, 36, 2023.
  35. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  36. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978, 2009.
  37. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019.
  38. Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  39. Shannon, C. E. Coding theorems for a discrete source with a fidelity criterion. IRE National Convention Record, 4:142–163, 1959.
  40. Slonim, N. The information bottleneck: Theory and applications. PhD thesis, Hebrew University of Jerusalem Jerusalem, Israel, 2002.
  41. Agglomerative information bottleneck. In Solla, S., Leen, T., and Müller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/be3e9d3f7d70537357c67bb3f4086846-Paper.pdf.
  42. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215, 2000.
  43. Geometric clustering using the information bottleneck method. In Advances in neural information processing systems, 2003.
  44. Learning to tokenize for generative retrieval. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=UKd6dpVGdu.
  45. Semantic-enhanced differentiable search index inspired by learning strategies. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, pp.  4904–4913, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701030. doi: 10.1145/3580305.3599903. URL https://doi.org/10.1145/3580305.3599903.
  46. Transformer memory as a differentiable search index. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  21831–21843. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/892840a6123b5ec99ebaab8be1530fba-Paper-Conference.pdf.
  47. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp.  1–5. IEEE, 2015.
  48. The information bottleneck method. ArXiv, physics/0004057, 2000. URL https://api.semanticscholar.org/CorpusID:8936496.
  49. Neural discrete representation learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf.
  50. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  51. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  52. Deep hashing network for unsupervised domain adaptation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5385–5394, 2017. URL https://api.semanticscholar.org/CorpusID:2928248.
  53. A neural corpus indexer for document retrieval. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=fSfcEYQP_qc.
  54. Novo: Learnable and interpretable document identifiers for model-based ir. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, pp.  2656–2665, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701245. doi: 10.1145/3583780.3614993. URL https://doi.org/10.1145/3583780.3614993.
  55. Vector quantization: a review. Frontiers of Information Technology & Electronic Engineering, 20(4):507–524, 2019.
  56. Scalable and effective generative information retrieval. arXiv preprint arXiv:2311.09134, 2023.
  57. Ultron: An ultimate retriever on corpus with a model-based indexer. arXiv preprint arXiv:2208.09257, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.