Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation (2206.10128v3)

Published 21 Jun 2022 in cs.IR and cs.CL

Abstract: The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.

Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

The paper introduces a novel framework, DSI-QG, designed to enhance the Differentiable Search Index (DSI) by addressing critical data distribution mismatch issues between indexing and retrieval phases. This method offers a significant advancement over the traditional DSI models, particularly in contexts requiring cross-lingual information retrieval. The DSI-QG framework employs query generation to transform document indexing representations, effectively bridging the gap traditionally seen between the differing data at indexing (long-form documents) and retrieval (short queries) stages.

Core Contributions and Methodology

The authors' primary contribution is the articulation of the data distribution mismatch in existing DSI models, which manifests when indexing uses full document representations while retrieval relies on shorter user queries. This issue is particularly pronounced when deploying DSI in cross-lingual environments, where document and query languages differ. In response, DSI-QG leverages a powerful combination of query generation and cross-encoder ranking.

  • Query Generation: Leveraging a transformer-based sequence-to-sequence model, DSI-QG generates plausible queries for documents at indexing time. This transformation ensures that both input scenarios—indexing and retrieval—now operate over similar data distributions, specifically that of query formats, mitigating the mismatch problem.
  • Cross-Encoder Ranking: The framework employs a cross-encoder to rank and select a subset of generated queries, ensuring high relevance and appropriateness. This aids in optimizing the quality of document representation used within the model.

Implications and Results

Empirically, the DSI-QG framework demonstrates substantial improvements in standard retrieval metrics over its predecessors, particularly on datasets like NQ 320k and XOR QA 100k. For instance, Hits@1 and Hits@10 metrics improve notably over baseline DSI implementations, showcasing DSI-QG's superior handling of generated, ranked queries to more effectively map to document identifiers during retrieval tasks. These improvements are not merely marginal; they represent a decisive step in enhancing DSI effectiveness.

The proposed method's ability to extend gracefully to cross-lingual scenarios is especially noteworthy. By enabling the generation and integration of multilingual query sets, DSI-QG caters to complex retrieval environments where language mismatches are potential obstacles, showcasing adaptability and use-case scalability.

Future Directions and Theoretical Considerations

The implications of DSI-QG extend beyond empirical enhancements, hinting at broader theoretical and practical developments. This framework exemplifies a movement towards more integrated, adaptive retrieval systems that merge elements of natural language understanding with robust, flexible indexing approaches.

Potential future developments include refining query generation models to yield even richer and more contextually diverse query representations and exploring the computational trade-offs inherent in ranking generated queries. Additionally, further exploration into the scalability of these methods on larger and more diverse datasets would be valuable, particularly when addressing real-time querying in multilingual and multimodal datasets.

In conclusion, the paper offers a substantial contribution to the field of information retrieval, effectively aligning the complexities of indexing and querying in novel ways that position differentiable architectures at the forefront of research and practical applications in cross-lingual and complex querying environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. XOR QA: Cross-lingual Open-Retrieval Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 547–564.
  2. Autoregressive search engines: Generating substrings as document identifiers. arXiv preprint arXiv:2204.10628 (2022).
  3. InPars: Unsupervised Dataset Generation for Information Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2387–2392.
  4. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  5. Autoregressive Entity Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=5k8F6UU39V
  6. Multilingual autoregressive entity linking. Transactions of the Association for Computational Linguistics 10 (2022), 274–290.
  7. Approximate nearest-neighbour search with inverted signature slice lists. In european conference on information retrieval. Springer, 147–158.
  8. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).
  9. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 8440–8451.
  10. Zhuyun Dai and Jamie Callan. 2020. Context-aware term weighting for first stage passage retrieval. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1533–1536.
  11. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 889–898.
  12. Paolo Ferragina and Giovanni Manzini. 2000. Opportunistic data structures with applications. In Proceedings 41st annual symposium on foundations of computer science. IEEE, 390–398.
  13. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292.
  14. Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2843–2853.
  15. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3030–3042.
  16. Rethink training of BERT rerankers in multi-stage retrieval pipeline. In European Conference on Information Retrieval. Springer, 280–286.
  17. Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval. ArXiv abs/2203.05765 (2022).
  18. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).
  19. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  20. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781.
  21. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
  22. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466.
  23. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6193–6202.
  24. Searching for an Effective Defender: Benchmarking Defense against Adversarial Word Substitution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3137–3147.
  25. Jimmy Lin. 2022. A proposed conceptual framework for a representational approach to information retrieval. In ACM SIGIR Forum, Vol. 55. ACM New York, NY, USA, 1–29.
  26. Jimmy Lin and Xueguang Ma. 2021. A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. arXiv preprint arXiv:2106.14807 (2021).
  27. Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021). 2356–2362.
  28. Pretrained transformers for text ranking: Bert and beyond. Synthesis Lectures on Human Language Technologies 14, 4 (2021), 1–325.
  29. Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386 (2020).
  30. The emerging trends of multi-label learning. IEEE transactions on pattern analysis and machine intelligence (2021).
  31. ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval. arXiv preprint arXiv:2205.09153 (2022).
  32. Improving Biomedical Information Retrieval with Neural Retrievers. Proceedings of the AAAI Conference on Artificial Intelligence 36, 10 (Jun. 2022), 11038–11046.
  33. CharBERT: Character-aware Pre-trained Language Model. In Proceedings of the 28th International Conference on Computational Linguistics. 39–50.
  34. Learning passage impacts for inverted indexes. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1723–1727.
  35. Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.
  36. Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019).
  37. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  39. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  40. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2825–2835.
  41. Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389.
  42. Reduce, Reuse, Recycle: Green Information Retrieval Research. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2825–2837.
  43. Transformer Memory as a Differentiable Search Index. CoRR abs/2202.06991 (2022). arXiv:2202.06991 https://arxiv.org/abs/2202.06991
  44. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 6000–6010.
  45. GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval. arXiv e-prints (2021), arXiv–2112.
  46. TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations. 347–355.
  47. A Neural Corpus Indexer for Document Retrieval. In Advances in Neural Information Processing Systems.
  48. Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1, 2 (1989), 270–280.
  49. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 38–45.
  50. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations.
  51. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 483–498.
  52. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503–1512.
  53. Mind the Gap: Cross-Lingual Information Retrieval with Hierarchical Knowledge Enhancement. Proceedings of the AAAI Conference on Artificial Intelligence 36, 4 (Jun. 2022), 4345–4353.
  54. Adversarial Retriever-Ranker for Dense Text Retrieval. In International Conference on Learning Representations. https://openreview.net/forum?id=MR7XubKUFB
  55. Shengyao Zhuang and Guido Zuccon. 2021a. Dealing with Typos for BERT-based Passage Retrieval and Ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2836–2842.
  56. Shengyao Zhuang and Guido Zuccon. 2021b. Fast passage re-ranking with contextualized exact term matching and efficient passage expansion. arXiv preprint arXiv:2108.08513 (2021).
  57. Shengyao Zhuang and Guido Zuccon. 2021c. TILDE: Term independent likelihood moDEl for passage re-ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1483–1492.
  58. Shengyao Zhuang and Guido Zuccon. 2022. CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  59. Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6–es.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shengyao Zhuang (42 papers)
  2. Houxing Ren (16 papers)
  3. Linjun Shou (53 papers)
  4. Jian Pei (104 papers)
  5. Ming Gong (246 papers)
  6. Guido Zuccon (73 papers)
  7. Daxin Jiang (138 papers)
Citations (57)
X Twitter Logo Streamline Icon: https://streamlinehq.com