Text Transformations in Contrastive Self-Supervised Learning: A Review
(2203.12000v2)
Published 22 Mar 2022 in cs.CL and cs.LG
Abstract: Contrastive self-supervised learning has become a prominent technique in representation learning. The main step in these methods is to contrast semantically similar and dissimilar pairs of samples. However, in the domain of NLP, the augmentation methods used in creating similar pairs with regard to contrastive learning (CL) assumptions are challenging. This is because, even simply modifying a word in the input might change the semantic meaning of the sentence, and hence, would violate the distributional hypothesis. In this review paper, we formalize the contrastive learning framework, emphasize the considerations that need to be addressed in the data transformation step, and review the state-of-the-art methods and evaluations for contrastive representation learning in NLP. Finally, we describe some challenges and potential directions for learning better text representations using contrastive methods.
This paper, "Text Transformations in Contrastive Self-Supervised Learning: A Review" (Bhattacharjee et al., 2022), provides a review of contrastive self-supervised learning (CL) methods in NLP, with a specific focus on the challenges and techniques related to data transformations (augmentations) and negative sampling for text.
The core idea of contrastive learning is to learn representations by pulling similar samples closer together in an embedding space and pushing dissimilar samples farther apart. In a typical setup, an anchor sample is paired with a positive sample (semantically similar) and several negative samples (semantically dissimilar). The paper highlights that applying this framework to text is challenging because even small modifications to the input text can drastically change its semantic meaning, violating the assumption that positive pairs are semantically invariant.
The review formalizes the CL framework for NLP and discusses the crucial steps of creating positive and negative samples. It emphasizes that unlike in image processing where transformations like cropping or rotation are more intuitive and tend to preserve semantic meaning, text transformations require careful consideration.
The paper categorizes text transformations used to generate positive pairs into three main types:
Input-Space Transformations: These are operations performed directly on the text tokens. Examples include span sampling (selecting adjacent, overlapping, or subsumed segments), word or span deletion, token reordering, and synonym substitution. Standard augmentation techniques like random insertion, swap, and deletion are also mentioned.
Latent-Space Transformations: These methods involve manipulating the text in an intermediate representation space. Techniques like back-translation using an intermediate language and using LLMs (like word2vec or GloVe) to replace words with similar ones fall into this category. Document-level CL can also use different views like the original document, its gold summary, and a generated summary as positive pairs.
Transformations via Architecture and Combined Methods: This category includes methods that use architectural variations or combine different approaches. A notable example is using different dropout masks when encoding the same input twice to create positive pairs (e.g., SimCSE). Adversarial training techniques, which perturb the input to generate positive examples, and leveraging inference relations from NLI datasets (where entailment pairs are positive, and neutral/contradiction are negative) are also discussed.
Beyond positive pair generation, the paper stresses the importance of negative sampling. While negative samples are often sampled uniformly from a batch, this can lead to sub-optimal representations if false negatives (semantically similar samples incorrectly labeled as negative) are included. The concept of "hard negatives" – negative samples that are semantically similar to the anchor but belong to a different latent class – is crucial but challenging to mine in unsupervised settings. Techniques discussed to address this include increasing batch size (though this has memory costs and theoretical limitations), using heuristics based on similarity metrics, and methods like Approximate Nearest Neighbor Negative Contrastive Estimation (ANCE) which sample negatives from the entire dataset.
The paper also reviews different contrastive loss functions used in the literature, starting from early forms for metric learning to those specifically adapted for contrastive representation learning:
Noise Contrastive Estimation (NCE): A general form aiming to distinguish positive pairs from negative ones using a similarity function (often cosine similarity with a temperature parameter).
Triplet Loss: Minimizes the distance between an anchor and a positive while ensuring the distance between the anchor and a negative is greater than a margin. Its main challenge is hard negative mining.
Lifted Structure Loss: Addresses the hard negative mining issue by considering all positive and negative pairs within a batch, trying to find "difficult" negatives for a set of positive samples.
N-pairs Loss: Improves efficiency over triplet loss by computing the loss using a tuple of one anchor, one positive, and multiple negative samples simultaneously.
For evaluating unsupervised CL representations beyond downstream task performance, the paper introduces two metrics proposed by Wang and Isola (Mamis, 2020):
Alignment: Measures how close representations of positive pairs are on the hypersphere space. Ideally, the distance between positive pair embeddings should be small.
Uniformity: Measures how uniformly the learned representations are distributed on the hypersphere. A uniform distribution is desired to preserve the information content of the data.
Finally, the review identifies several key challenges and open problems in text CL:
Selecting Good Transformation Functions: Choosing augmentations that preserve semantic meaning and are appropriate for the downstream task and language remains difficult.
Negative Samples and Sampling Bias: Mitigating the risk of sampling false negatives in unsupervised settings is an ongoing challenge.
Counterfactually-Augmented Data: Exploring the potential of using counterfactual examples, which estimate latent classes, to generate plausible positive and negative samples while adhering to CL assumptions.
Euclidean vs. non-Euclidean Spaces: Determining whether Euclidean, hyperbolic (e.g., for hierarchical data), or spherical embedding spaces are best suited for capturing the natural representation of text semantic relationships.
In summary, the paper provides a comprehensive overview of the CL framework applied to text, detailing various data transformation and negative sampling strategies, loss functions, and evaluation metrics. It critically discusses the unique challenges posed by text data compared to other domains like images and highlights promising directions for future research to improve self-supervised text representation learning.