On the Sentence Embeddings from Pre-trained Language Models (2011.05864v1)

Published 2 Nov 2020 in cs.CL and cs.LG

Abstract: Pre-trained contextual representations like BERT have achieved great success in natural language processing. However, the sentence embeddings from the pre-trained LLMs without fine-tuning have been found to poorly capture semantic meaning of sentences. In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited. We first reveal the theoretical connection between the masked LLM pre-training objective and the semantic similarity task theoretically, and then analyze the BERT sentence embeddings empirically. We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity. To address this issue, we propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective. Experimental results show that our proposed BERT-flow method obtains significant performance gains over the state-of-the-art sentence embeddings on a variety of semantic textual similarity tasks. The code is available at https://github.com/bohanli/BERT-flow.

PDF Abstract

Analysis of Sentence Embeddings from Pre-trained LLMs

The paper, "On the Sentence Embeddings from Pre-trained LLMs," examines the limitations of sentence embeddings derived from models like BERT, arguing that the semantic information within these embeddings is underutilized. The authors propose a new method, referred to as BERT-flow, to enhance the performance of sentence embeddings on semantic textual similarity (STS) tasks.

Problem Statement and Research Questions

Despite the successes of BERT and similar models for various NLP tasks, their raw sentence embeddings perform suboptimally on semantic similarity tasks. This is surprising given the potential of these models to reflect semantic relationships. The paper investigates two primary questions:

Why do BERT's sentence embeddings perform poorly on retrieving semantically similar sentences?
How can the latent semantic information within these embeddings be effectively utilized without labeled data?

Theoretical and Empirical Analysis

The authors first establish a theoretical connection between the Masked LLM (MLM) training objective and semantic similarity tasks. This theoretically suggests that BERT should capture semantic similarities, which conflicts with empirical results. The analysis points to issues of an anisotropic semantic space produced by BERT, which adversely affects similarity evaluations.

Proposed Solution: BERT-flow

To address these issues, the authors introduce BERT-flow, which leverages normalizing flows to transform the anisotropic BERT sentence embeddings into a smooth, isotropic Gaussian distribution. Normalizing flows are invertible transformations parameterized by neural networks, enabling the transformation of complex data distributions into simpler forms.

Experimental Findings

Extensive experimental results demonstrate BERT-flow's superior performance over state-of-the-art sentence embeddings across various STS tasks:

Without NLI supervision, BERT-flow significantly enhances performance compared to both native BERT embeddings and previous state-of-the-art methods such as averaged GloVe embeddings.
When NLI supervision is used, BERT-flow further improves performance, surpassing competitive models like Sentence-BERT and SRoBERTa in most cases.

The results indicate that BERT-flow effectively mitigates the detrimental impacts of anisotropy and captures semantic information more reliably.

Lexical vs. Semantic Similarity

A notable finding is that BERT-induced similarity often correlates with lexical similarity rather than semantic similarity, particularly for sentence pairs with minimal edit distances. BERT-flow reduces this correlation, aligning more closely with human-annotated semantic similarities.

Implications and Future Directions

The implications of this research are profound for tasks relying on semantic similarity assessments. BERT-flow offers a path toward improving machine understanding of semantics without the need for extensive labeled datasets. Future research could explore further applications of flow-based models in other domains of NLP, expanding on the nuances of embedding spaces and their transformations.

Conclusion

The paper provides a comprehensive exploration into the challenges and solutions surrounding sentence embeddings from pre-trained models. By addressing the anisotropy in embeddings, the proposed BERT-flow method successfully enhances performance on semantic textual similarity tasks, offering insights into the latent capabilities of pre-trained LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Bohan Li (88 papers)
Hao Zhou (351 papers)
Junxian He (66 papers)
Mingxuan Wang (83 papers)
Yiming Yang (151 papers)
Lei Li (1293 papers)

Citations (203)

View on Semantic Scholar