Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval (2209.05917v3)

Published 13 Sep 2022 in cs.IR

Abstract: Sparse document representations have been widely used to retrieve relevant documents via exact lexical matching. Owing to the pre-computed inverted index, it supports fast ad-hoc search but incurs the vocabulary mismatch problem. Although recent neural ranking models using pre-trained LLMs can address this problem, they usually require expensive query inference costs, implying the trade-off between effectiveness and efficiency. Tackling the trade-off, we propose a novel uni-encoder ranking model, Sparse retriever using a Dual document Encoder (SpaDE), learning document representation via the dual encoder. Each encoder plays a central role in (i) adjusting the importance of terms to improve lexical matching and (ii) expanding additional terms to support semantic matching. Furthermore, our co-training strategy trains the dual encoder effectively and avoids unnecessary intervention in training each other. Experimental results on several benchmarks show that SpaDE outperforms existing uni-encoder ranking models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. CoRR abs/2010.00768 (2020).
  2. Overview of Touché 2021: Argument Retrieval. In CLEF (Lecture Notes in Computer Science, Vol. 12880). Springer, 450–467.
  3. A Full-Text Learning to Rank Dataset for Medical Information Retrieval. In ECIR (Lecture Notes in Computer Science, Vol. 9626). Springer, 716–722.
  4. Simplified TinyBERT: Knowledge Distillation for Document Retrieval. In ECIR, Vol. 12657. 241–248.
  5. Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? CoRR abs/2110.06918 (2021).
  6. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL. 2270–2282.
  7. Overview of the TREC 2020 deep learning track. CoRR abs/2102.07662 (2021).
  8. Overview of the TREC 2019 deep learning track. CoRR abs/2003.07820 (2020).
  9. Zhuyun Dai and Jamie Callan. 2019a. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. CoRR abs/1910.10687 (2019).
  10. Zhuyun Dai and Jamie Callan. 2019b. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In SIGIR. 985–988.
  11. Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting for Ad-Hoc Search. In WWW. 1897–1907.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
  13. CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims. CoRR abs/2012.00614 (2020).
  14. Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-max indexes. In SIGIR. 993–1002.
  15. SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. CoRR (2021).
  16. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In SIGIR. 2288–2292.
  17. Luyu Gao and Jamie Callan. 2021a. Condenser: a Pre-training Architecture for Dense Retrieval. In EMNLP. 981–993.
  18. Luyu Gao and Jamie Callan. 2021b. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. CoRR abs/2108.05540 (2021).
  19. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In NAACL-HLT. 3030–3042.
  20. Complementing Lexical Retrieval with Semantic Residual Embedding. CoRR abs/2004.13969 (2020).
  21. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM. 55–64.
  22. DBpedia-Entity v2: A Test Collection for Entity Search. In SIGIR. 1265–1268.
  23. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In SIGIR. 113–122.
  24. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In EMNLP. 1049–1058.
  25. Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. In EMNLP. 1016–1029.
  26. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535–547.
  27. Karen Sparck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 60, 5 (2004), 493–502.
  28. Dense Passage Retrieval for Open-Domain Question Answering. In EMNLP. 6769–6781.
  29. Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In SIGIR. 39–48.
  30. Natural Questions: a Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguistics 7 (2019), 452–466.
  31. PARADE: Passage Representation Aggregation for Document Reranking. CoRR abs/2008.09093 (2020).
  32. Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. CoRR abs/2106.14807 (2021).
  33. Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In SIGIR. 2149–2152.
  34. Pretrained Transformers for Text Ranking: BERT and Beyond. CoRR abs/2010.06467 (2020).
  35. Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. CoRR abs/2010.11386 (2020).
  36. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
  37. Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. In SIGIR. 49–58.
  38. Expansion via Prediction of Importance with Contextualization. In SIGIR. 1573–1576.
  39. CEDR: Contextualized Embeddings for Document Ranking. In SIGIR. 1101–1104.
  40. Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation. CoRR abs/2110.11540 (2021).
  41. WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. In WWW. 1941–1942.
  42. Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (2020), 824–836.
  43. Learning Passage Impacts for Inverted Indexes. In SIGIR. 1723–1727.
  44. PISA: Performant Indexes and Search for Academia. In OSIRRC@SIGIR, Vol. 2409. 50–56.
  45. Learning to Match using Local and Distributed Representations of Text for Web Search. In WWW. 1291–1299.
  46. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NeurIPS.
  47. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. CoRR abs/1901.04085 (2019).
  48. Rodrigo Nogueira and Jimmy Lin. 2020. From doc2query to docTTTTTquery. Online preprint (2020).
  49. Multi-Stage Document Ranking with BERT. CoRR abs/1910.14424 (2019).
  50. Document Expansion by Query Prediction. CoRR abs/1904.08375 (2019).
  51. Minimizing FLOPs to Learn Efficient Sparse Representations. In ICLR.
  52. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In NAACL-HLT. 5835–5847.
  53. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.
  54. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980–3990.
  55. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In EMNLP. 2825–2835.
  56. Stephen E. Robertson and Steve Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In SIGIR. 232–241.
  57. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. CoRR (2021).
  58. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. CoRR (2021).
  59. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In NAACL-HLT. 809–819.
  60. Attention is All you Need. In NeurIPS. 5998–6008.
  61. TREC-COVID: constructing a pandemic information retrieval test collection. SIGIR Forum 54, 1 (2020), 1:1–1:12.
  62. Retrieval of the Best Counterargument without Prior Topic Knowledge. In ACL, Iryna Gurevych and Yusuke Miyao (Eds.). 241–251.
  63. Fact or Fiction: Verifying Scientific Claims. In EMNLP. 7534–7550.
  64. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In Findings of ACL. 2140–2151.
  65. Marco Wrzalik and Dirk Krechel. 2021. CoRT: Complementary Rankings from Transformers. In NAACL-HLT. 4194–4204.
  66. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR. 55–64.
  67. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR.
  68. Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking. CoRR abs/2112.09628 (2021).
  69. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In SIGIR. 1253–1256.
  70. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP 2018. 2369–2380.
  71. Applying BERT to Document Retrieval with Birch. In EMNLP-IJCNLP. 19–24.
  72. From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In CIKM. 497–506.
  73. Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. In CIKM. 2487–2496.
  74. Optimizing Dense Retrieval Model Training with Hard Negatives. In SIGIR. 1503–1512.
  75. Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. In WSDM. ACM, 1328–1336.
  76. Shengyao Zhuang and Guido Zuccon. 2021a. Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. CoRR abs/2108.08513 (2021).
  77. Shengyao Zhuang and Guido Zuccon. 2021b. TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. In SIGIR. 1483–1492.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Eunseong Choi (8 papers)
  2. Sunkyung Lee (9 papers)
  3. Minjin Choi (22 papers)
  4. Hyeseon Ko (1 paper)
  5. Young-In Song (2 papers)
  6. Jongwuk Lee (24 papers)
Citations (15)