Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval (2304.08138v2)

Published 17 Apr 2023 in cs.IR

Abstract: Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained LLM-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on \textit{fine-tuning} strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel re-training strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016).
  2. Elias Bassani. 2022. ranx: A Blazing-Fast Python Library for Ranking Evaluation and Comparison. In ECIR (2) (Lecture Notes in Computer Science), Vol. 13186. Springer, 259–264.
  3. Alessandro Benedetti and Elia Porciani. 2022. DENSE RETRIEVAL WITH APACHE SOLR NEURAL SEARCH. In ECIR 2022 Industry Day.
  4. Out-of-domain semantics to the rescue! zero-shot hybrid retrieval models. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I. Springer, 95–110.
  5. Towards Robust Dense Retrieval via Local Ranking Alignment. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. International Joint Conferences on Artificial Intelligence Organization, 1980–1986.
  6. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. In Annual Conference of the North American Chapter of the Association for Computational Linguistics.
  7. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In International Conference on Learning Representations. https://openreview.net/forum?id=r1xMH1BtvB
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  9. Open Challenges in the Application of Dense Retrieval for Case Law Search. In ECIR 2022 Industry Day.
  10. CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. In Proceedings of the 28th International Conference on Computational Linguistics. 6903–6915.
  11. The vocabulary problem in human-system communication. Commun. ACM 30, 11 (1987), 964–971.
  12. Luyu Gao and Jamie Callan. 2021. Condenser: a Pre-training Architecture for Dense Retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 981–993.
  13. Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2843–2853.
  14. Rethink training of BERT rerankers in multi-stage retrieval pipeline. In European Conference on Information Retrieval. Springer, 280–286.
  15. Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval. ArXiv abs/2203.05765 (2022).
  16. A large-scale query spelling correction corpus. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1261–1264.
  17. Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666 (2020).
  18. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 113–122.
  19. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553–2561.
  20. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
  21. Yubin Kim. 2022. Applications and Future of Dense Retrieval in Industry. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 3373–3374. https://doi.org/10.1145/3477495.3536324
  22. Carlos Lassance and Stéphane Clinchant. 2023. The tale of two MS MARCO–and their unfair comparisons. arXiv preprint arXiv:2304.12904 (2023).
  23. Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386 (2020).
  24. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). 163–173.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  26. Zheng Liu and Yingxia Shao. 2022. RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder. arXiv preprint arXiv:2205.12035 (2022).
  27. Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2780–2791.
  28. How Deep is Your Learning: The DL-HARD Annotated Deep Learning Dataset. Association for Computing Machinery, New York, NY, USA, 2335–2341. https://doi.org/10.1145/3404835.3463262
  29. Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1–27.
  30. Ragnar Nordlie. 1999. “User revealment”—a comparison of initial queries and ensuing question development in online searching and in human reference interactions. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. 11–18.
  31. Evaluating the robustness of retrieval pipelines with query variation generators. In European Conference on Information Retrieval. Springer, 397–412.
  32. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5835–5847.
  33. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2173–2183.
  34. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2825–2835.
  35. A thorough examination on zero-shot dense retrieval. arXiv preprint arXiv:2204.12755 (2022).
  36. Simple Entity-Centric Questions Challenge Dense Retrievers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6138–6148.
  37. LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval. arXiv preprint arXiv:2208.14754 (2022).
  38. Georgios Sidiropoulos and Evangelos Kanoulas. 2022. Analysing the Robustness of Dual Encoders for Dense Retrieval Against Misspellings (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2132–2136.
  39. Searching the web: The public and their queries. Journal of the American society for information science and technology 52, 3 (2001), 226–234.
  40. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
  41. Nicola Tonellotto. 2022. Lecture Notes on Neural Information Retrieval. arXiv preprint arXiv:2207.13443 (2022).
  42. Attention is all you need. Advances in neural information processing systems 30 (2017).
  43. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. arXiv preprint arXiv:2207.02578 (2022).
  44. Mining longitudinal Web queries: Trends and patterns. Journal of the american Society for Information Science and technology 54, 8 (2003), 743–758.
  45. Spelling correction in the PubMed search engine. Information retrieval 9, 5 (2006), 543–564.
  46. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, 38–45.
  47. Are neural ranking models robust? ACM Transactions on Information Systems 41, 2 (2022), 1–36.
  48. Contextual mask auto-encoder for dense passage retrieval. arXiv preprint arXiv:2208.07670 (2022).
  49. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In International Conference on Learning Representations.
  50. Optimizing Dense Retrieval Model Training with Hard Negatives. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).
  51. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503–1512.
  52. RepBERT: Contextualized text embeddings for first-stage retrieval. arXiv preprint arXiv:2006.15498 (2020).
  53. Dense text retrieval based on pretrained language models: A survey. arXiv preprint arXiv:2211.14876 (2022).
  54. Robustness of Neural Rankers to Typos: A Comparative Study. In Proceedings of the 26th Australasian document computing symposium.
  55. Shengyao Zhuang and Guido Zuccon. 2021. Dealing with Typos for BERT-based Passage Retrieval and Ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2836–2842.
  56. Shengyao Zhuang and Guido Zuccon. 2022a. Asyncval: A Toolkit for Asynchronously Validating Dense Retriever Checkpoints during Training. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3235–3239.
  57. Shengyao Zhuang and Guido Zuccon. 2022b. CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 1444–1454.
Citations (5)

Summary

We haven't generated a summary for this paper yet.