DenoSent: A Denoising Objective for Self-Supervised Sentence Representation Learning (2401.13621v1)
Abstract: Contrastive-learning-based methods have dominated sentence representation learning. These methods regularize the representation space by pulling similar sentence representations closer and pushing away the dissimilar ones and have been proven effective in various NLP tasks, e.g., semantic textual similarity (STS) tasks. However, it is challenging for these methods to learn fine-grained semantics as they only learn from the inter-sentence perspective, i.e., their supervision signal comes from the relationship between data samples. In this work, we propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective. By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form. Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks, standing up well in comparison to contrastive-learning-based methods. Notably, the proposed intra-sentence denoising objective complements existing inter-sentence contrastive methodologies and can be integrated with them to further enhance performance. Our code is available at https://github.com/xinghaow99/DenoSent.
- SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability.
- SemEval-2014 Task 10: Multilingual Semantic Textual Similarity.
- SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation.
- SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity.
- *SEM 2013 shared task: Semantic Textual Similarity.
- A Cookbook of Self-Supervised Learning. arXiv:2304.12210.
- Efficient Intent Detection with Dual Sentence Encoders.
- SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation.
- Universal Sentence Encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning.
- A Simple Framework for Contrastive Learning of Visual Representations.
- Improving Contrastive Learning of Sentence Embeddings from AI Feedback. arXiv:2305.01918.
- DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings. arXiv:2204.10298.
- SPECTER: Document-level Representation Learning using Citation-informed Transformers.
- SentEval: An Evaluation Toolkit for Universal Sentence Representations. arXiv preprint arXiv:1803.05449.
- Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. arXiv:1705.02364.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Ethayarajh, K. 2019. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512.
- MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages.
- Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
- DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. In Meeting of the Association for Computational Linguistics.
- Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483.
- Semantic re-tuning with contrastive tension. In International Conference on Learning Representations, 2021.
- PromptBERT: Improving BERT Sentence Embeddings with Prompts. arXiv preprint arXiv:2201.04337.
- Kaggle. 2019. ToxicConversations.
- Kaggle. 2020. TweetSentimentExtraction.
- Self-Guided Contrastive Learning for BERT Sentence Representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Skip-thought vectors. Advances in neural information processing systems.
- Semi-supervised Question Retrieval with Gated Convolutions.
- On the Sentence Embeddings from Pre-trained Language Models. arXiv:2011.05864.
- MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark.
- LinkSO: A Dataset for Learning to Retrieve Similar Question Answer Pairs on Software Development Forums. In Proceedings of the 4th ACM SIGSOFT International Workshop on NLP for Software Engineering.
- Self-supervised Learning: Generative or Contrastive. IEEE Transactions on Knowledge and Data Engineering.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893.
- Decoupled Weight Decay Regularization.
- A SICK Cure for the Evaluation of Compositional Distributional Semantic Models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14).
- Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In Proceedings of the 7th ACM Conference on Recommender Systems.
- COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining. arXiv:2102.08473.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- MTEB: Massive Text Embedding Benchmark. arXiv preprint arXiv:2210.07316.
- I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews.
- Representation Learning with Contrastive Predictive Coding.
- Training language models to follow instructions with human feedback. arXiv:2203.02155.
- GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
- CARER: Contextualized Affect Representations for Emotion Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- wav2vec: Unsupervised Pre-training for Speech Recognition. arXiv:1904.05862.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research.
- One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
- Whitening Sentence Representations for Better Semantics and Faster Retrieval. arXiv:2103.15316.
- Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- OPUS-MT – Building open translation services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation.
- Attention is all you need. Advances in neural information processing systems, 30.
- SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples. arXiv:2201.05979.
- TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning. arXiv:2104.06979.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning.
- Finetuned Language Models Are Zero-Shot Learners. arXiv:2109.01652.
- Sentence Representation Learning with Generative Objective rather than Contrastive Objective. arXiv:2210.08474.
- MIND: A Large-scale Dataset for News Recommendation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.
- Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv preprint arXiv:2105.11741.
- Universal sentence representation learning with conditional masked language model. arXiv preprint arXiv:2012.14388.
- Universal Sentence Representation Learning with Conditional Masked Language Model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
- An Unsupervised Sentence Embedding Method by Mutual Information Maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-wise Perspective in Angular Space. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Debiased Contrastive Learning of Unsupervised Sentence Representations. arXiv preprint arXiv:2205.00656.
- Xinghao Wang (15 papers)
- Junliang He (5 papers)
- Pengyu Wang (63 papers)
- Yunhua Zhou (27 papers)
- Tianxiang Sun (35 papers)
- Xipeng Qiu (257 papers)