Contextualization with SPLADE for High Recall Retrieval (2405.03972v1)
Abstract: High Recall Retrieval (HRR), such as eDiscovery and medical systematic review, is a search problem that optimizes the cost of retrieving most relevant documents in a given collection. Iterative approaches, such as iterative relevance feedback and uncertainty sampling, are shown to be effective under various operational scenarios. Despite neural models demonstrating success in other text-related tasks, linear models such as logistic regression, in general, are still more effective and efficient in HRR since the model is trained and retrieves documents from the same fixed collection. In this work, we leverage SPLADE, an efficient retrieval model that transforms documents into contextualized sparse vectors, for HRR. Our approach combines the best of both worlds, leveraging both the contextualization from pretrained LLMs and the efficiency of linear models. It reduces 10% and 18% of the review cost in two HRR evaluation collections under a one-phase review workflow with a target recall of 80%. The experiment is implemented with TARexp and is available at https://github.com/eugene-yang/LSR-for-TAR.
- Perspectives on Predictive Coding: And Other Advanced Search Methods for the Legal Practitioner. American Bar Association, Section of Litigation. https://books.google.com/books?id=TdJ2AQAACAAJ
- The Sedona conference® best practices commentary on the use of search and information retrieval methods in e-discovery. In The Sedona conference journal, Vol. 8. 189–223.
- David C Blair and Melvin E Maron. 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun. ACM 28, 3 (1985), 289–299.
- Gordon F. Cormack and Maura F. Grossman. 2014. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. SIGIR 2014 (2014), 153–162. https://doi.org/10.1145/2600428.2609601.
- Gordon V. Cormack and Maura R. Grossman. 2016. Engineering Quality and Reliability in Technology-Assisted Review. In SIGIR. ACM Press, Pisa, Italy, 75–84. https://doi.org/10.1145/2911451.2911510 00024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- From distillation to hard negative sampling: Making sparse neural ir models more effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2353–2359.
- SPLADE: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2288–2292.
- Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 2843–2853.
- TREC 2016 Total Recall Track Overview.. In TREC.
- Karyn Harty. 2017. Discovery Program. In Law Society Gazette. Vol. 111. Dublin, Ireland, 44–47.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
- Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
- David D Lewis. 2016. Defining and Estimating Effectiveness in Document Review. In Perspectives on Predictive Coding: And Other Advanced Search Methods for the Legal Practitioner. American Bar Association, Section of Litigation.
- David D Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994. Elsevier, 148–156.
- David D. Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In SIGIR 1994. 3–12.
- Certifying One-Phase Technology-Assisted Reviews. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM). https://arxiv.org/abs/2108.12746
- Certifying one-phase technology-assisted reviews. In Proceedings of the 30th ACM international conference on information & knowledge management. 893–902.
- RCV1: A New Benchmark Collection for Text Categorization Research. JMLR 5 (2004), 361–397.
- Dan Li and Evangelos Kanoulas. 2020. When to stop reviewing in technology-assisted reviews: Sampling from an adaptive distribution to estimate residual relevant documents. ACM Transactions on Information Systems (TOIS) 38, 4 (2020), 1–36.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
- Expansion via prediction of importance with contextualization. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1573–1576.
- PISA: Performant indexes and search for academia. Proceedings of the Open-Source IR Replicability Challenge (2019).
- A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR. In European Conference on Information Retrieval.
- A Unified Framework for Learned Sparse Retrieval. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part III. Springer, 101–116.
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv preprint arXiv:1611.09268 (2016). arXiv:1611.09268 http://arxiv.org/abs/1611.09268
- Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 708–718. https://doi.org/10.18653/v1/2020.findings-emnlp.63
- Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019).
- Evaluation of information retrieval for E-discovery. Artificial Intelligence and Law 18 (2010), 347–386.
- Douglas W. Oard and William Webber. 2013. Information Retrieval for E-Discovery. Foundations and Trends® in Information Retrieval 7, 2–3 (2013), 99–237. https://doi.org/10.1561/1500000025
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
- TREC 2015 Total Recall Track Overview. In TREC.
- PLAID: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 1747–1756.
- Mark Stevenson and Reem Bin-Hezam. 2023. Stopping Methods for Technology-assisted Reviews Based on Point Processes. ACM Transactions on Information Systems 42, 3 (2023), 1–37.
- Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics 11, 1 (2010), 55.
- Effectiveness results for popular e-discovery algorithms. In Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law. 261–264.
- Eugene Yang and David D Lewis. 2022. TARexp: A Python Framework for Technology-Assisted Review Experiments. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3256–3261.
- A regularization approach to combining keywords and training data in technology-assisted review. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law. 153–162.
- Text retrieval priors for Bayesian logistic regression. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1045–1048.
- Heuristic stopping rules for technology-assisted review. In Proceedings of the 21st ACM Symposium on Document Engineering. 1–10.
- On Minimizing Cost in Legal Document Review Workflows. In Proceedings of the 21st ACM Symposium on Document Engineering.
- Goldilocks: Just-right tuning of bert for technology-assisted review. In European Conference on Information Retrieval. Springer, 502–517.
- Eugene Yang (38 papers)