TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval (2002.06275v1)

Published 14 Feb 2020 in cs.IR and cs.LG

Abstract: Pre-trained LLMs like BERT have achieved great success in a wide variety of NLP tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency IR systems. We present TwinBERT model for effective and efficient retrieval, which has twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings and produce a similarity score. Different from BERT, where the two input sentences are concatenated and encoded together, TwinBERT decouples them during encoding and produces the embeddings for query and document independently, which allows document embeddings to be pre-computed offline and cached in memory. Thereupon, the computation left for run-time is from the query encoding and query-document crossing only. This single change can save large amount of computation time and resources, and therefore significantly improve serving efficiency. Moreover, a few well-designed network layers and training strategies are proposed to further reduce computational cost while at the same time keep the performance as remarkable as BERT model. Lastly, we develop two versions of TwinBERT for retrieval and relevance tasks correspondingly, and both of them achieve close or on-par performance to BERT-Base model. The model was trained following the teacher-student framework and evaluated with data from one of the major search engines. Experimental results showed that the inference time was significantly reduced and was firstly controlled around 20ms on CPUs while at the same time the performance gain from fine-tuned BERT-Base model was mostly retained. Integration of the models into production systems also demonstrated remarkable improvements on relevance metrics with negligible influence on latency.

PDF Abstract

TwinBERT: Efficient Retrieval with Decoupled BERT Structures

The paper "TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval" addresses the challenges posed by the computational intensity of pre-trained LLMs, particularly BERT, in the domain of low-latency information retrieval (IR) systems. This research introduces TwinBERT, a novel model architecture designed to reduce serving latency while retaining the performance gains offered by BERT.

Model Architecture and Efficiency

TwinBERT diverges from the conventional BERT architecture by employing twin-structured encoders that independently process the query and the document. The key innovation lies in decoupling these inputs during encoding, which allows document embeddings to be pre-computed offline. This approach significantly reduces the real-time computational load by limiting it to query encoding and the interaction between the query and document embeddings via a crossing layer.

The paper describes two versions of this crossing layer: TwinBERT $_{\text{cos}}$ , utilizing cosine similarity, and TwinBERT $_{\text{res}}$ , employing a residual network structure. TwinBERT $_{\text{cos}}$ supports approximate nearest neighbor (ANN) search, enabling efficient retrieval operations. TwinBERT $_{\text{res}}$ , while computationally more intensive due to its deep network structure, promises better relevance prediction.

Experimental Results

The effectiveness of TwinBERT is illustrated through rigorous experiments. In relevance tasks, TwinBERT demonstrated an AUC performance close to the BERT-Base model, with the residual network version surpassing other compressed models like the 3-layer BERT and the C-DSSM model. Retrieval tasks showed TwinBERT $_{\text{cos}}$ producing more relevant document matches, outperforming traditional C-DSSM models.

The inference time results are particularly noteworthy; TwinBERT achieves up to 663 times faster processing compared to BERT-Base models, with query processing optimized to less than 20 milliseconds on CPUs for 100 documents. This efficiency is pivotal for real-time IR systems where latency is a critical constraint.

Implications and Deployment

TwinBERT's deployment in a major sponsored search system exemplifies its practical applicability and scalability. The model integrates seamlessly, reducing poor ad impressions while maintaining system latency. This deployment underscores the broader implications for AI systems, extending the applicability of BERT-inspired architectures to contexts requiring rapid processing efficiencies.

Future Directions

Looking forward, the paper hints at potential expansions of TwinBERT beyond IR, including domains such as question answering and digital assistants. The adaptability of its architecture suggests promising avenues for further refinements, such as integrating newer advancements in Transformers or enhancing the model's capacity by leveraging larger teacher models like BERT-Large.

In conclusion, TwinBERT is a thoughtful response to the limitations of deploying deep neural networks within real-time applications. By strategically decoupling inputs and innovating the network architecture, it offers a blueprint for scaling AI models without compromising their operational effectiveness. This makes TwinBERT a valuable contribution to the ongoing discourse on efficient AI deployment strategies.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Wenhao Lu (17 papers)
Jian Jiao (44 papers)
Ruofei Zhang (24 papers)

Citations (48)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos