A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the ChatGPT Era and Beyond (2305.12641v3)
Abstract: Sentence representations are a critical component in NLP applications such as retrieval, question answering, and text classification. They capture the meaning of a sentence, enabling machines to understand and reason over human language. In recent years, significant progress has been made in developing methods for learning sentence representations, including unsupervised, supervised, and transfer learning approaches. However there is no literature review on sentence representations till now. In this paper, we provide an overview of the different methods for sentence representation learning, focusing mostly on deep learning models. We provide a systematic organization of the literature, highlighting the key contributions and challenges in this area. Overall, our review highlights the importance of this area in natural language processing, the progress made in sentence representation learning, and the challenges that remain. We conclude with directions for future research, suggesting potential avenues for improving the quality and efficiency of sentence representations.
- Efficient sentence embedding using discrete cosine transform. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3672–3678, Hong Kong, China. Association for Computational Linguistics.
- Incorporating visual semantics into sentence representations within a grounded space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 696–707, Hong Kong, China. Association for Computational Linguistics.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Exploring the impact of negative samples of contrastive learning: A case study of sentence embedding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3138–3152, Dublin, Ireland. Association for Computational Linguistics.
- Alleviating over-smoothing for unsupervised sentence representation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3552–3566, Toronto, Canada. Association for Computational Linguistics.
- An information minimization based contrastive learning model for unsupervised sentence embeddings learning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4821–4831, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Generate, discriminate and contrast: A semi-supervised sentence representation learning framework. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8150–8161, Abu Dhabi, UAE. Association for Computational Linguistics.
- Xingyi Cheng. 2021. Dual-view distilled bert for sentence embedding. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2151–2155, New York, NY, USA. Association for Computing Machinery.
- DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, WA. Association for Computational Linguistics.
- Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
- Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220.
- Clustering-aware negative sampling for unsupervised sentence representation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8713–8729, Toronto, Canada. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Marco Di Giovanni and Marco Brambilla. 2021. Exploiting Twitter as source of large corpora of weakly similar pairs for semantic sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9902–9910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Towards interpreting and mitigating shortcut learning behavior of NLU models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 915–929, Online. Association for Computational Linguistics.
- Kawin Ethayarajh. 2019. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 55–65, Hong Kong, China. Association for Computational Linguistics.
- Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- DeCLUTR: Deep contrastive learning for unsupervised textual representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 879–895, Online. Association for Computational Linguistics.
- Bootstrap your own latent a new approach to self-supervised learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA. Curran Associates Inc.
- Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735.
- Distilling the knowledge in a neural network. arXiv, (1503.02531).
- Disentangling semantics and syntax in sentence embeddings with pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1372–1379, Online. Association for Computational Linguistics.
- WhiteningBERT: An easy unsupervised sentence embedding approach. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 238–244, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Non-linguistic supervision for contrastive learning of sentence embeddings. In Advances in Neural Information Processing Systems.
- PromptBERT: Improving BERT sentence embeddings with prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8826–8837, Abu Dhabi, UAE. Association for Computational Linguistics.
- Improved universal sentence embeddings with prompt-based contrastive learning and energy-based learning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3021–3035, Abu Dhabi, UAE. Association for Computational Linguistics.
- DSPy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714.
- Self-guided contrastive learning for BERT sentence representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2528–2540, Online. Association for Computational Linguistics.
- Skip-thought vectors. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, page 3294–3302, Cambridge, MA, USA. MIT Press.
- Tassilo Klein and Moin Nabi. 2022. SCD: Self-contrastive decorrelation of sentence embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 394–400, Dublin, Ireland. Association for Computational Linguistics.
- On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9119–9130, Online. Association for Computational Linguistics.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
- Towards general text embeddings with multi-stage contrastive learning.
- Fast, effective, and self-supervised: Transforming masked language models into universal lexical and sentence encoders. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1442–1459, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- RankCSE: Unsupervised sentence representations learning via learning to rank. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13785–13802, Toronto, Canada. Association for Computational Linguistics.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
- Locality preserving sentence encoding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3050–3060, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Niklas Muennighoff. 2022. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
- Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, Dublin, Ireland. Association for Computational Linguistics.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, UAE. Association for Computational Linguistics.
- Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
- EASE: Entity-aware contrastive learning of sentence embedding. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3870–3885, Seattle, United States. Association for Computational Linguistics.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Timo Schick and Hinrich Schütze. 2021. Generating datasets with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6943–6951, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Semantics altering modifications for evaluating comprehension in machine reading. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13762–13770.
- Stretching sentence-pair NLI models to reason over long documents and clusters. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 394–412, Abu Dhabi, UAE. Association for Computational Linguistics.
- Ranking-enhanced unsupervised sentence representation learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15783–15798, Toronto, Canada. Association for Computational Linguistics.
- A sentence is worth 128 pseudo tokens: A semantic-aware contrastive learning framework for sentence embeddings. In Findings of the Association for Computational Linguistics: ACL 2022, pages 246–256, Dublin, Ireland. Association for Computational Linguistics.
- Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 296–310, Online. Association for Computational Linguistics.
- BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Efficient few-shot learning without prompts. arXiv, (2209.11055).
- TSDAE: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 671–688, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Tianduo Wang and Wei Lu. 2022. Differentiable data augmentation for contrastive sentence representation learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7640–7653, Abu Dhabi, UAE. Association for Computational Linguistics.
- Improving contrastive learning of sentence embeddings with case-augmented positives and retrieved negatives. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2159–2165, New York, NY, USA. Association for Computing Machinery.
- A bilingual generative transformer for semantic sentence embedding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1581–1594, Online. Association for Computational Linguistics.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- BigScience Workshop. 2023. Bloom: A 176b-parameter open-access multilingual language model.
- Bohong Wu and Hai Zhao. 2022. Sentence representation learning with generative objective rather than contrastive objective. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3356–3368, Abu Dhabi, UAE. Association for Computational Linguistics.
- PCL: Peer-contrastive learning with diverse augmentations for unsupervised sentence embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 12052–12066, Abu Dhabi, UAE. Association for Computational Linguistics.
- InfoCSE: Information-aggregated contrastive learning of sentence embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3060–3070, Abu Dhabi, UAE. Association for Computational Linguistics.
- Smoothed contrastive learning for unsupervised sentence embedding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4902–4906, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3898–3907, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- C-pack: Packaged resources to advance general chinese embedding.
- Universal sentence representation learning with conditional masked language model. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6216–6228, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Contrastive learning with prompt-derived virtual semantic prototypes for unsupervised sentence embedding. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 7042–7053, Abu Dhabi, UAE. Association for Computational Linguistics.
- Virtual augmentation supported contrastive learning of sentence representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 864–876, Dublin, Ireland. Association for Computational Linguistics.
- MCSE: Multimodal contrastive learning of sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5959–5969, Seattle, WA. Association for Computational Linguistics.
- MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computational Linguistics, 11:1114–1131.
- Bootstrapped unsupervised sentence representation learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5168–5180, Online. Association for Computational Linguistics.
- An unsupervised sentence embedding method by mutual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610, Online. Association for Computational Linguistics.
- Unsupervised sentence representation via contrastive learning with mixing negatives. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11730–11738.
- A contrastive framework for learning sentence representations from pairwise and triple-wise perspective in angular space. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4892–4903, Dublin, Ireland. Association for Computational Linguistics.
- Debiased contrastive learning of unsupervised sentence representations. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6120–6130, Dublin, Ireland. Association for Computational Linguistics.
- Learning dialogue representations from consecutive utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 754–768, Seattle, WA. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.