Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (1908.05161v3)

Published 14 Aug 2019 in cs.LG, cs.CL, and stat.ML

Abstract: Recent state-of-the-art natural language understanding models, such as BERT and XLNet, score a pair of sentences (A and B) using multiple cross-attention operations - a process in which each word in sentence A attends to all words in sentence B and vice versa. As a result, computing the similarity between a query sentence and a set of candidate sentences, requires the propagation of all query-candidate sentence-pairs throughout a stack of cross-attention layers. This exhaustive process becomes computationally prohibitive when the number of candidate sentences is large. In contrast, sentence embedding techniques learn a sentence-to-vector mapping and compute the similarity between the sentence vectors via simple elementary operations. In this paper, we introduce Distilled Sentence Embedding (DSE) - a model that is based on knowledge distillation from cross-attentive models, focusing on sentence-pair tasks. The outline of DSE is as follows: Given a cross-attentive teacher model (e.g. a fine-tuned BERT), we train a sentence embedding based student model to reconstruct the sentence-pair scores obtained by the teacher model. We empirically demonstrate the effectiveness of DSE on five GLUE sentence-pair tasks. DSE significantly outperforms several ELMO variants and other sentence embedding methods, while accelerating computation of the query-candidate sentence-pairs similarities by several orders of magnitude, with an average relative degradation of 4.6% compared to BERT. Furthermore, we show that DSE produces sentence embeddings that reach state-of-the-art performance on universal sentence representation benchmarks. Our code is made publicly available at https://github.com/microsoft/Distilled-Sentence-Embedding.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Oren Barkan (29 papers)
Noam Razin (15 papers)
Itzik Malkiel (19 papers)
Ori Katz (66 papers)
Avi Caciularu (46 papers)
Noam Koenigstein (31 papers)

Citations (37)

View on Semantic Scholar

GitHub

GitHub - microsoft/Distilled-Sentence-Embedding: Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation (31 stars)

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (1908.05161v3)

Related Papers

GitHub