Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (1908.10084v1)

Published 27 Aug 2019 in cs.CL

Abstract: BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods.

PDF Abstract

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

The paper "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" by Nils Reimers and Iryna Gurevych addresses a critical challenge in NLP: the derivation of semantically meaningful sentence embeddings that are computationally efficient for tasks such as semantic similarity search, clustering, and information retrieval. By modifying the BERT architecture to use siamese and triplet networks, they propose Sentence-BERT (SBERT), a method that creates fixed-size sentence embeddings which can be easily compared using cosine-similarity.

Introduction

BERT and its variant RoBERTa have shown excellent performance on a range of NLP tasks, including sentence-pair regression tasks such as Semantic Textual Similarity (STS). These models, however, are computationally intensive as they require both sentences to be fed into the network simultaneously, making them impractical for large-scale tasks. The paper introduces SBERT to address these limitations by enabling single-pass computation of sentence embeddings.

Model and Methodology

SBERT modifies the BERT network by introducing a pooling layer to derive specific, fixed-size sentence embeddings. These embeddings are fine-tuned via siamese and triplet network structures to ensure semantically similar sentences are close in the vector space. The primary methods experimented with include:

Classification Objective Function: Features concatenation of embeddings with trainable weights, optimized using cross-entropy loss.
Regression Objective Function: Utilizes cosine similarity between embeddings, optimized with mean squared error loss.
Triplet Objective Function: Utilizes triplet loss to ensure embeddings of similar sentences are closer.

The pooling layer experiments included CLS-token output, mean-pooling, and max-pooling, with mean-pooling generally yielding the best results.

Evaluation

The evaluation covered both unsupervised and supervised tasks across multiple datasets such as STS12-16, STS benchmark, SICK-R, and more. The results showed that SBERT significantly outperforms traditional BERT embeddings, as well as other sentence embedding methods like InferSent and Universal Sentence Encoder.

Unsupervised STS Tasks

SBERT demonstrated superior performance on several STS tasks, achieving improvements such as 11.7 points over InferSent and 5.5 points over Universal Sentence Encoder. Importantly, the computational efficiency of SBERT is dramatically enhanced, reducing the sentence similarity search time from 65 hours (BERT) to approximately 5 seconds.

Supervised STS with STS Benchmark

For supervised tasks, fine-tuning SBERT on the STS Benchmark data achieved comparable or better results than state-of-the-art methods, emphasizing the effectiveness of SBERT's architecture in various contexts.

Specialized Evaluations

SBERT was also evaluated on more specialized datasets like the Argument Facet Similarity (AFS) corpus and Wikipedia Sections Distinction dataset, further solidifying its standing as a robust and versatile method for generating high-quality sentence embeddings:

AFS Corpus: SBERT nearly matched BERT’s performance in 10-fold cross-validation but showed a performance drop in cross-topic evaluation, highlighting the challenge for generalization across diverse topics.
Wikipedia Sections Distinction: SBERT achieved high accuracy, outperforming previous approaches and demonstrating its capacity for fine-grained semantic understanding.

SentEval Toolkit

Using the SentEval toolkit, SBERT also excelled on various sentiment and classification tasks, surpassing InferSent and Universal Sentence Encoder in most tasks. This demonstrates SBERT's versatility beyond mere semantic similarity tasks.

Computational Efficiency

Efficiency tests confirmed that SBERT is notably faster than other methods like InferSent and Universal Sentence Encoder, especially when leveraging GPUs and optimized batching strategies.

Conclusion

SBERT represents a significant step in sentence embedding methods by combining the strengths of BERT with the computational efficiency of siamese and triplet network structures. SBERT delivers high-quality embeddings that can be efficiently computed and effectively applied across a broad range of NLP tasks. The work underscores not only the practical implications for large-scale semantic tasks but also the ongoing evolution of efficient NLP models.

In conclusion, the SBERT model demonstrates substantial improvements in both the performance and scalability of sentence embeddings, making it a valuable tool for researchers and practitioners alike. Future directions could involve further refinements and adaptations of SBERT to address broader sets of NLP challenges and datasets.