SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval (2209.05917v3)
Abstract: Sparse document representations have been widely used to retrieve relevant documents via exact lexical matching. Owing to the pre-computed inverted index, it supports fast ad-hoc search but incurs the vocabulary mismatch problem. Although recent neural ranking models using pre-trained LLMs can address this problem, they usually require expensive query inference costs, implying the trade-off between effectiveness and efficiency. Tackling the trade-off, we propose a novel uni-encoder ranking model, Sparse retriever using a Dual document Encoder (SpaDE), learning document representation via the dual encoder. Each encoder plays a central role in (i) adjusting the importance of terms to improve lexical matching and (ii) expanding additional terms to support semantic matching. Furthermore, our co-training strategy trains the dual encoder effectively and avoids unnecessary intervention in training each other. Experimental results on several benchmarks show that SpaDE outperforms existing uni-encoder ranking models.
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval. CoRR abs/2010.00768 (2020).
- Overview of Touché 2021: Argument Retrieval. In CLEF (Lecture Notes in Computer Science, Vol. 12880). Springer, 450–467.
- A Full-Text Learning to Rank Dataset for Medical Information Retrieval. In ECIR (Lecture Notes in Computer Science, Vol. 9626). Springer, 716–722.
- Simplified TinyBERT: Knowledge Distillation for Document Retrieval. In ECIR, Vol. 12657. 241–248.
- Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One? CoRR abs/2110.06918 (2021).
- SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL. 2270–2282.
- Overview of the TREC 2020 deep learning track. CoRR abs/2102.07662 (2021).
- Overview of the TREC 2019 deep learning track. CoRR abs/2003.07820 (2020).
- Zhuyun Dai and Jamie Callan. 2019a. Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval. CoRR abs/1910.10687 (2019).
- Zhuyun Dai and Jamie Callan. 2019b. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In SIGIR. 985–988.
- Zhuyun Dai and Jamie Callan. 2020. Context-Aware Document Term Weighting for Ad-Hoc Search. In WWW. 1897–1907.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
- CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims. CoRR abs/2012.00614 (2020).
- Shuai Ding and Torsten Suel. 2011. Faster top-k document retrieval using block-max indexes. In SIGIR. 993–1002.
- SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval. CoRR (2021).
- SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In SIGIR. 2288–2292.
- Luyu Gao and Jamie Callan. 2021a. Condenser: a Pre-training Architecture for Dense Retrieval. In EMNLP. 981–993.
- Luyu Gao and Jamie Callan. 2021b. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. CoRR abs/2108.05540 (2021).
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In NAACL-HLT. 3030–3042.
- Complementing Lexical Retrieval with Semantic Residual Embedding. CoRR abs/2004.13969 (2020).
- A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM. 55–64.
- DBpedia-Entity v2: A Test Collection for Entity Search. In SIGIR. 1265–1268.
- Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In SIGIR. 113–122.
- PACRR: A Position-Aware Neural IR Model for Relevance Matching. In EMNLP. 1049–1058.
- Ultra-High Dimensional Sparse Representations with Binarization for Efficient Text Retrieval. In EMNLP. 1016–1029.
- Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data 7, 3 (2021), 535–547.
- Karen Sparck Jones. 2004. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 60, 5 (2004), 493–502.
- Dense Passage Retrieval for Open-Domain Question Answering. In EMNLP. 6769–6781.
- Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In SIGIR. 39–48.
- Natural Questions: a Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguistics 7 (2019), 452–466.
- PARADE: Passage Representation Aggregation for Document Reranking. CoRR abs/2008.09093 (2020).
- Jimmy Lin and Xueguang Ma. 2021. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. CoRR abs/2106.14807 (2021).
- Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format. In SIGIR. 2149–2152.
- Pretrained Transformers for Text Ranking: BERT and Beyond. CoRR abs/2010.06467 (2020).
- Distilling Dense Representations for Ranking using Tightly-Coupled Teachers. CoRR abs/2010.11386 (2020).
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations. In SIGIR. 49–58.
- Expansion via Prediction of Importance with Contextualization. In SIGIR. 1573–1576.
- CEDR: Contextualized Embeddings for Document Ranking. In SIGIR. 1101–1104.
- Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation. CoRR abs/2110.11540 (2021).
- WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. In WWW. 1941–1942.
- Yury A. Malkov and Dmitry A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 42, 4 (2020), 824–836.
- Learning Passage Impacts for Inverted Indexes. In SIGIR. 1723–1727.
- PISA: Performant Indexes and Search for Academia. In OSIRRC@SIGIR, Vol. 2409. 50–56.
- Learning to Match using Local and Distributed Representations of Text for Web Search. In WWW. 1291–1299.
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. In NeurIPS.
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. CoRR abs/1901.04085 (2019).
- Rodrigo Nogueira and Jimmy Lin. 2020. From doc2query to docTTTTTquery. Online preprint (2020).
- Multi-Stage Document Ranking with BERT. CoRR abs/1910.14424 (2019).
- Document Expansion by Query Prediction. CoRR abs/1904.08375 (2019).
- Minimizing FLOPs to Learn Efficient Sparse Representations. In ICLR.
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In NAACL-HLT. 5835–5847.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3980–3990.
- RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In EMNLP. 2825–2835.
- Stephen E. Robertson and Steve Walker. 1994. Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. In SIGIR. 232–241.
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. CoRR (2021).
- BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. CoRR (2021).
- FEVER: a Large-scale Dataset for Fact Extraction and VERification. In NAACL-HLT. 809–819.
- Attention is All you Need. In NeurIPS. 5998–6008.
- TREC-COVID: constructing a pandemic information retrieval test collection. SIGIR Forum 54, 1 (2020), 1:1–1:12.
- Retrieval of the Best Counterargument without Prior Topic Knowledge. In ACL, Iryna Gurevych and Yusuke Miyao (Eds.). 241–251.
- Fact or Fiction: Verifying Scientific Claims. In EMNLP. 7534–7550.
- MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In Findings of ACL. 2140–2151.
- Marco Wrzalik and Dirk Krechel. 2021. CoRT: Complementary Rankings from Transformers. In NAACL-HLT. 4194–4204.
- End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR. 55–64.
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR.
- Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking. CoRR abs/2112.09628 (2021).
- Anserini: Enabling the Use of Lucene for Information Retrieval Research. In SIGIR. 1253–1256.
- HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP 2018. 2369–2380.
- Applying BERT to Document Retrieval with Birch. In EMNLP-IJCNLP. 19–24.
- From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing. In CIKM. 497–506.
- Jointly Optimizing Query Encoder and Product Quantization to Improve Retrieval Performance. In CIKM. 2487–2496.
- Optimizing Dense Retrieval Model Training with Hard Negatives. In SIGIR. 1503–1512.
- Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. In WSDM. ACM, 1328–1336.
- Shengyao Zhuang and Guido Zuccon. 2021a. Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion. CoRR abs/2108.08513 (2021).
- Shengyao Zhuang and Guido Zuccon. 2021b. TILDE: Term Independent Likelihood moDEl for Passage Re-ranking. In SIGIR. 1483–1492.
- Eunseong Choi (8 papers)
- Sunkyung Lee (9 papers)
- Minjin Choi (22 papers)
- Hyeseon Ko (1 paper)
- Young-In Song (2 papers)
- Jongwuk Lee (24 papers)