Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InRanker: Distilled Rankers for Zero-shot Information Retrieval (2401.06910v1)

Published 12 Jan 2024 in cs.IR

Abstract: Despite multi-billion parameter neural rankers being common components of state-of-the-art information retrieval pipelines, they are rarely used in production due to the enormous amount of compute required for inference. In this work, we propose a new method for distilling large rankers into their smaller versions focusing on out-of-domain effectiveness. We introduce InRanker, a version of monoT5 distilled from monoT5-3B with increased effectiveness on out-of-domain scenarios. Our key insight is to use LLMs and rerankers to generate as much as possible synthetic "in-domain" training data, i.e., data that closely resembles the data that will be seen at retrieval time. The pipeline consists of two distillation phases that do not require additional user queries or manual annotations: (1) training on existing supervised soft teacher labels, and (2) training on teacher soft labels for synthetic queries generated using a LLM. Consequently, models like monoT5-60M and monoT5-220M improved their effectiveness by using the teacher's knowledge, despite being 50x and 13x smaller, respectively. Models and code are available at https://github.com/unicamp-dl/InRanker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Can Generative LLMs Create Query Variants for Test Collections? An Exploratory Study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1869–1873. https://doi.org/10.1145/3539618.3591960
  2. InPars: Unsupervised Dataset Generation for Information Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2387–2392. https://doi.org/10.1145/3477495.3531863
  3. InPars-Light: Cost-Effective Unsupervised Training of Efficient Rankers. arXiv:2301.02998 [cs.IR]
  4. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  5. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311 [cs.CL]
  6. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416 [cs.LG]
  7. Overview of the TREC 2020 deep learning track. arXiv:2102.07662 [cs.IR]
  8. Overview of the TREC 2021 Deep Learning Track. In Proceedings of the Thirtieth Text REtrieval Conference, TREC 2021, online, November 15-19, 2021 (NIST Special Publication, Vol. 500-335), Ian Soboroff and Angela Ellis (Eds.). National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf
  9. Overview of the TREC 2022 Deep Learning Track. In Proceedings of the Thirty-First Text REtrieval Conference, TREC 2022, online, November 15-19, 2022 (NIST Special Publication, Vol. 500-338), Ian Soboroff and Angela Ellis (Eds.). National Institute of Standards and Technology (NIST). https://trec.nist.gov/pubs/trec31/papers/Overview_deep.pdf
  10. Promptagator: Few-shot Dense Retrieval From 8 Examples. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=gmL46YMpu2J
  11. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 2353–2359. https://doi.org/10.1145/3477495.3531857
  12. Specializing Smaller Language Models towards Multi-Step Reasoning. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 10421–10430. https://proceedings.mlr.press/v202/fu23d.html
  13. Dense Retrieval Adaptation Using Target Domain Description. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval (Taipei, Taiwan) (ICTIR ’23). Association for Computing Machinery, New York, NY, USA, 95–104. https://doi.org/10.1145/3578337.3605127
  14. Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [stat.ML]
  15. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv:2010.02666 [cs.IR]
  16. InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval. arXiv:2301.01820 [cs.IR]
  17. Teaching Small Language Models to Reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 1773–1781. https://doi.org/10.18653/v1/2023.acl-short.151
  18. CTR-BERT: Cost-effective knowledge distillation for billion-parameter teacher models. In NeurIPS Efficient Natural Language and Speech Processing Workshop.
  19. Text and Code Embeddings by Contrastive Pre-Training. arXiv:2201.10005 [cs.CL]
  20. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs.CL]
  21. Large Dual Encoders Are Generalizable Retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 9844–9855. https://doi.org/10.18653/v1/2022.emnlp-main.669
  22. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 708–718. https://doi.org/10.18653/v1/2020.findings-emnlp.63
  23. Improving Content Retrievability in Search with Controllable Query Generation. In Proceedings of the ACM Web Conference 2023 (Austin, TX, USA) (WWW ’23). Association for Computing Machinery, New York, NY, USA, 3182–3192. https://doi.org/10.1145/3543507.3583261
  24. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv:2101.05667 [cs.IR]
  25. A Thorough Examination on Zero-shot Dense Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 15783–15796. https://doi.org/10.18653/v1/2023.findings-emnlp.1057
  26. No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval. arXiv:2206.02873 [cs.IR]
  27. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=wCu6T5xFjeJ
  28. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 5776–5788. https://proceedings.neurips.cc/paper_files/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  29. RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2308–2313. https://doi.org/10.1145/3539618.3592047
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Thiago Laitz (6 papers)
  2. Konstantinos Papakostas (4 papers)
  3. Roberto Lotufo (41 papers)
  4. Rodrigo Nogueira (70 papers)
Citations (2)

Summary

Analyzing "InRanker: Distilled Rankers for Zero-shot Information Retrieval"

The paper "InRanker: Distilled Rankers for Zero-shot Information Retrieval" introduces a novel method aimed at addressing the computational challenges inherent in using large-scale neural rankers in information retrieval (IR) systems. This research builds on the premise that while multi-billion parameter neural models demonstrate high effectiveness on various IR tasks, their significant computational requirements make them impractical for deployment in latency-critical production environments. Herein, the authors propose a distillation technique to compress large rankers into smaller, efficient models without significantly sacrificing performance, particularly in out-of-domain retrieval tasks.

Methodological Framework

The core contribution of this paper is the InRanker methodology, which entails a two-phase distillation approach. The authors distill the large monoT5-3B model into smaller versions, namely InRanker-60M and InRanker-220M, with a focus on enhancing zero-shot performance on unseen data. The innovation lies in generating synthetic "in-domain" training data through LLMs and rerankers, circumventing the need for additional user queries or manual annotations.

  1. Phase One - Soft Label Training: This initial distillation phase involves utilizing pre-existing supervised soft labels from a large teacher model (monoT5-3B) to train smaller versions. The objective here is to align the student model's logits with those of the teacher model on datasets like MS MARCO, facilitating effective initial learning.
  2. Phase Two - Synthetic Query Distillation: In this phase, synthetic queries are generated using a LLM to create a corpus that embodies the target domain's semantic characteristics. Training involves further recalibrating the student model's representations using soft labels derived from these synthetic queries, simulating an environment as close to what will be encountered during actual retrieval tasks as possible.

Empirical Results

The numerical evaluations presented reveal considerable improvements in the nDCG@10 metric for the distilled models, particularly InRanker-60M and InRanker-220M, when tested against the BEIR benchmark. The results are noteworthy given that these models, despite being 50x and 13x smaller respectively than their teacher model, show a retention of performance with only marginal trade-offs. Interestingly, the results also suggest enhancement even when employing self-distillation, wherein a model is trained on soft labels derived from its own inputs.

The paper contrasts the distilled models with existing strong baselines, such as Promptagator and RankT5, establishing competitive performance in specific tasks despite disparities in several dataset scores. The analysis further acknowledges that the incorporated synthetic data could potentially be made more representative, as evidenced by the superior results obtained when real dataset queries are used for distillation instead.

Implications and Future Directions

The implications of this work extend into practical and theoretical domains. Practically, the ability to deploy more resource-efficient models without significant performance degradation is vital for scalable IR applications that operate under strict latency constraints. Theoretically, the paper underscores the viability of knowledge distillation in enhancing model generalization across domains while also highlighting the beneficial impact of well-crafted synthetic data in model training.

The framework sets a precedent for further exploration into optimizing the synthetic query generation process, potentially through advanced LLM techniques or further refinements in soft label usage. Future research could investigate diverse application domains, exploring how domain-specific characteristics influence the efficacy of distilled models. Further, refining distillation models using a variety of teacher models could provide insights into the robustness and adaptability of student models across different IR landscapes.

Through these explorations, the paper contributes significantly to the discourse on efficient IR model deployment, highlighting pathways towards achieving greater adaptability with limited computational trade-offs.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub