InRanker: Distilled Rankers for Zero-shot Information Retrieval (2401.06910v1)

Published 12 Jan 2024 in cs.IR

Abstract: Despite multi-billion parameter neural rankers being common components of state-of-the-art information retrieval pipelines, they are rarely used in production due to the enormous amount of compute required for inference. In this work, we propose a new method for distilling large rankers into their smaller versions focusing on out-of-domain effectiveness. We introduce InRanker, a version of monoT5 distilled from monoT5-3B with increased effectiveness on out-of-domain scenarios. Our key insight is to use LLMs and rerankers to generate as much as possible synthetic "in-domain" training data, i.e., data that closely resembles the data that will be seen at retrieval time. The pipeline consists of two distillation phases that do not require additional user queries or manual annotations: (1) training on existing supervised soft teacher labels, and (2) training on teacher soft labels for synthetic queries generated using a LLM. Consequently, models like monoT5-60M and monoT5-220M improved their effectiveness by using the teacher's knowledge, despite being 50x and 13x smaller, respectively. Models and code are available at https://github.com/unicamp-dl/InRanker.

References (29)

Authors (4)

Thiago Laitz (6 papers)
Konstantinos Papakostas (4 papers)
Roberto Lotufo (41 papers)
Rodrigo Nogueira (70 papers)

Citations (2)

View on Semantic Scholar

Summary

Analyzing "InRanker: Distilled Rankers for Zero-shot Information Retrieval"

The paper "InRanker: Distilled Rankers for Zero-shot Information Retrieval" introduces a novel method aimed at addressing the computational challenges inherent in using large-scale neural rankers in information retrieval (IR) systems. This research builds on the premise that while multi-billion parameter neural models demonstrate high effectiveness on various IR tasks, their significant computational requirements make them impractical for deployment in latency-critical production environments. Herein, the authors propose a distillation technique to compress large rankers into smaller, efficient models without significantly sacrificing performance, particularly in out-of-domain retrieval tasks.

Methodological Framework

The core contribution of this paper is the InRanker methodology, which entails a two-phase distillation approach. The authors distill the large monoT5-3B model into smaller versions, namely InRanker-60M and InRanker-220M, with a focus on enhancing zero-shot performance on unseen data. The innovation lies in generating synthetic "in-domain" training data through LLMs and rerankers, circumventing the need for additional user queries or manual annotations.

Phase One - Soft Label Training: This initial distillation phase involves utilizing pre-existing supervised soft labels from a large teacher model (monoT5-3B) to train smaller versions. The objective here is to align the student model's logits with those of the teacher model on datasets like MS MARCO, facilitating effective initial learning.
Phase Two - Synthetic Query Distillation: In this phase, synthetic queries are generated using a LLM to create a corpus that embodies the target domain's semantic characteristics. Training involves further recalibrating the student model's representations using soft labels derived from these synthetic queries, simulating an environment as close to what will be encountered during actual retrieval tasks as possible.

Empirical Results

The numerical evaluations presented reveal considerable improvements in the nDCG@10 metric for the distilled models, particularly InRanker-60M and InRanker-220M, when tested against the BEIR benchmark. The results are noteworthy given that these models, despite being 50x and 13x smaller respectively than their teacher model, show a retention of performance with only marginal trade-offs. Interestingly, the results also suggest enhancement even when employing self-distillation, wherein a model is trained on soft labels derived from its own inputs.

The paper contrasts the distilled models with existing strong baselines, such as Promptagator and RankT5, establishing competitive performance in specific tasks despite disparities in several dataset scores. The analysis further acknowledges that the incorporated synthetic data could potentially be made more representative, as evidenced by the superior results obtained when real dataset queries are used for distillation instead.

Implications and Future Directions

The implications of this work extend into practical and theoretical domains. Practically, the ability to deploy more resource-efficient models without significant performance degradation is vital for scalable IR applications that operate under strict latency constraints. Theoretically, the paper underscores the viability of knowledge distillation in enhancing model generalization across domains while also highlighting the beneficial impact of well-crafted synthetic data in model training.

The framework sets a precedent for further exploration into optimizing the synthetic query generation process, potentially through advanced LLM techniques or further refinements in soft label usage. Future research could investigate diverse application domains, exploring how domain-specific characteristics influence the efficacy of distilled models. Further, refining distillation models using a variety of teacher models could provide insights into the robustness and adaptability of student models across different IR landscapes.

Through these explorations, the paper contributes significantly to the discourse on efficient IR model deployment, highlighting pathways towards achieving greater adaptability with limited computational trade-offs.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - unicamp-dl/InRanker (46 stars)

Tweets

https://twitter.com/_reachsumit/status/1747518615543226705

https://twitter.com/rodrigfnogueira/status/1747919406926098563

https://twitter.com/knishimae0531/status/1747965078119612838