Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation (2103.06523v2)

Published 11 Mar 2021 in cs.IR

Abstract: BERT-based Neural Ranking Models (NRMs) can be classified according to how the query and document are encoded through BERT's self-attention layers - bi-encoder versus cross-encoder. Bi-encoder models are highly efficient because all the documents can be pre-processed before the query time, but their performance is inferior compared to cross-encoder models. Both models utilize a ranker that receives BERT representations as the input and generates a relevance score as the output. In this work, we propose a method where multi-teacher distillation is applied to a cross-encoder NRM and a bi-encoder NRM to produce a bi-encoder NRM with two rankers. The resulting student bi-encoder achieves an improved performance by simultaneously learning from a cross-encoder teacher and a bi-encoder teacher and also by combining relevance scores from the two rankers. We call this method TRMD (Two Rankers and Multi-teacher Distillation). In the experiments, TwinBERT and ColBERT are considered as baseline bi-encoders. When monoBERT is used as the cross-encoder teacher, together with either TwinBERT or ColBERT as the bi-encoder teacher, TRMD produces a student bi-encoder that performs better than the corresponding baseline bi-encoder. For P@20, the maximum improvement was 11.4%, and the average improvement was 6.8%. As an additional experiment, we considered producing cross-encoder students with TRMD, and found that it could also improve the cross-encoders.

PDF Abstract

Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation

The paper presents a novel approach to enhancing the performance of bi-encoder document ranking models in the context of BERT-based Neural Ranking Models (NRMs), focusing on their utilization in Information Retrieval (IR) systems. The proposed method, Two Rankers and Multi-teacher Distillation (TRMD), employs a two-ranker structure along with multi-teacher knowledge distillation to produce a superior bi-encoder model. This approach directly addresses the known inefficiencies of bi-encoder models compared to cross-encoder models by leveraging the strengths of both types of encoders.

Methodology

The TRMD method involves the use of two distinct rankers and combines the strength of cross-encoder and bi-encoder models through distillation. A cross-encoder, such as monoBERT, excels in understanding query-document interactions due to its holistic self-attention mechanism but is computationally intensive. On the other hand, bi-encoders like TwinBERT and ColBERT pre-compute document representations, leading to better efficiency at the cost of interaction modeling. The TRMD method utilizes both monoBERT and either TwinBERT or ColBERT as teachers in a knowledge distillation framework, significantly enriching the student bi-encoder model with representation from both teacher models.

The architecture is characterized by the student model integrating two parallel rankers that ingest distinct BERT representations, each corresponding to a teacher. The bi-encoder's architecture takes advantage of pre-computed document representations and complements them with distilled query-document interaction representations from the cross-encoder. This novel integration of two rankers with cross-encoder representation enhances the model's relevance predictions without drastically increasing inference costs.

Experimental Results

The experiments conducted using Robust04 and Clueweb09b datasets demonstrate the efficacy of TRMD in improving bi-encoder models like TwinBERT and ColBERT. The bi-encoder students trained with TRMD exceeded their baseline performances, with P@20 showing an increase up to 11.4% for TwinBERT and 6.0% for ColBERT. Furthermore, TRMD also showed improvements when applied to cross-encoder students, indicating its breadth of applicability across different model architectures. The loss convergence analysis affirms that the distillation process successfully transferred relevant semantic and interaction knowledge from the teachers to the student models, highlighting the effectiveness of multi-teacher distillation in this context.

Implications and Future Directions

The implications of this research are significant in practical settings where computational efficiency is paramount, such as in large-scale information retrieval operations. By narrowing the performance gap between bi-encoder and cross-encoder models while maintaining the efficiency advantage, TRMD offers a viable path forward in improving BERT-based NRMs. The method's adaptability to both bi-encoder and cross-encoder models further enhances its utility in a wide range of applications.

Future developments may explore extending TRMD to other neural architectures or incorporating additional types of distillation techniques to strengthen the model's ability to capture complex relationships in the data. Exploring alternative ranker structures or further optimizing the distillation objectives could yield additional performance gains. The interplay between the efficiency of bi-encoders and the superior performance of cross-encoders provides a rich avenue for continued research and development within the domain of information retrieval and beyond.

In conclusion, the paper provides a well-grounded and methodologically sound approach to bi-encoder improvement, with clear experimental evidence supporting the efficacy of TRMD in enhancing neural ranking models. This work opens new possibilities for achieving a balance between efficiency and performance in BERT-based NRMs used in contemporary information retrieval systems.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jaekeol Choi (7 papers)
Euna Jung (8 papers)
Jangwon Suh (4 papers)
Wonjong Rhee (34 papers)

Citations (21)

View on Semantic Scholar

Improving Bi-encoder Document Ranking Models with Two Rankers and Multi-teacher Distillation (2103.06523v2)