Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Best Practices for Distilling Large Language Models into BERT for Web Search Ranking (2411.04539v1)

Published 7 Nov 2024 in cs.IR and cs.CL

Abstract: Recent studies have highlighted the significant potential of LLMs as zero-shot relevance rankers. These methods predominantly utilize prompt learning to assess the relevance between queries and documents by generating a ranked list of potential documents. Despite their promise, the substantial costs associated with LLMs pose a significant challenge for their direct implementation in commercial search systems. To overcome this barrier and fully exploit the capabilities of LLMs for text ranking, we explore techniques to transfer the ranking expertise of LLMs to a more compact model similar to BERT, using a ranking loss to enable the deployment of less resource-intensive models. Specifically, we enhance the training of LLMs through Continued Pre-Training, taking the query as input and the clicked title and summary as output. We then proceed with supervised fine-tuning of the LLM using a rank loss, assigning the final token as a representative of the entire sentence. Given the inherent characteristics of autoregressive LLMs, only the final token </s> can encapsulate all preceding tokens. Additionally, we introduce a hybrid point-wise and margin MSE loss to transfer the ranking knowledge from LLMs to smaller models like BERT. This method creates a viable solution for environments with strict resource constraints. Both offline and online evaluations have confirmed the efficacy of our approach, and our model has been successfully integrated into a commercial web search engine as of February 2024.

Best Practices for Distilling LLMs into BERT for Web Search Ranking

The paper "Best Practices for Distilling LLMs into BERT for Web Search Ranking" investigates a pragmatic approach to harnessing the efficacy of LLMs for web search ranking, while mitigating the computational overhead associated with their deployment. Dezhi Ye and colleagues from Tencent address the challenges posed by the resource-intensive nature of LLMs, particularly when applied to commercial search engines, and propose a novel framework dubbed DisRanker. This methodology aims to transfer the ranking competencies of LLMs into a more efficient model akin to BERT.

Methodology

The proposed workflow begins with the continued pre-training (CPT) phase, where an LLM ingests clickstream data. Here, queries serve as inputs, prompting the model to generate titles and summaries, thereby improving the model's grasp of query-document interactions. Following CPT, the LLM undergoes supervised fine-tuning using a pairwise rank loss. The design deploys the end-of-sequence token to embody a comprehensive representation of the query-document relationship, a strategy that diverges from typical methods reliant on the [CLS] token in bidirectional encoders.

The distillation phase employs a hybrid loss strategy combining Point-wise Mean Squared Error (MSE) and Margin-MSE to facilitate knowledge transfer from the LLM to a streamlined BERT model. This approach allows the BERT model to not only mirror the absolute scores from the teacher model but also uphold the relational ranking, ensuring that the student model retains the ordinal characteristics necessary for ranking tasks.

Experimental Validation

Extensive experimentation reveal that the unsupervised LLM rankers lag behind domain-specific, fine-tuned BERT models. However, applying supervised rank loss fine-tuning, the performance markedly improves. Distillation using the combination of point-wise and margin-based losses further enhances the student's capacity for ranking, as evidenced by a 1.6% increase in Positive-Negative Ratio (PNR) over solely point-wise distillation methods.

In an online A/B testing setup, deploying the distilled model showed significant improvements in user engagement metrics, including a 0.47% increase in Page Click-Through Rates (CTR) and 1.2% enhancement in average dwell time, highlighting its practical benefits.

Implications and Prospects

The integration of DisRanker into existing search engine infrastructures underscores its potential to substantially reduce deployment costs and enhance performance efficiency. This is particularly pertinent given the increasing scale and complexity of web queries in commercial search engines. The approach offers a scalable solution adaptable to stringent computational resource constraints.

Moving forward, the framework presented could spur further research into optimizing distillation procedures and loss functions tailored for specific tasks within NLP. It could also inspire the design of more adaptive models that strike a balance between performance and computational efficiency, fostering advancements in AI applications beyond web search. Such models may pave the way for broader accessibility of powerful NLP tools across various domains, allowing widespread deployment without prohibitive resource investments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dezhi Ye (1 paper)
  2. Junwei Hu (9 papers)
  3. Jiabin Fan (2 papers)
  4. Bowen Tian (10 papers)
  5. Jie Liu (492 papers)
  6. Haijin Liang (4 papers)
  7. Jin Ma (64 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com