Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DefSent+: Improving sentence embeddings of language models by projecting definition sentences into a quasi-isotropic or isotropic vector space of unlimited dictionary entries (2405.16153v4)

Published 25 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: This paper presents a significant improvement on the previous conference paper known as DefSent. The prior study seeks to improve sentence embeddings of LLMs by projecting definition sentences into the vector space of dictionary entries. We discover that this approach is not fully explored due to the methodological limitation of using word embeddings of LLMs to represent dictionary entries. This leads to two hindrances. First, dictionary entries are constrained by the single-word vocabulary, and thus cannot be fully exploited. Second, semantic representations of LLMs are known to be anisotropic, but pre-processing word embeddings for DefSent is not allowed because its weight is frozen during training and tied to the prediction layer. In this paper, we propose a novel method to progressively build entry embeddings not subject to the limitations. As a result, definition sentences can be projected into a quasi-isotropic or isotropic vector space of unlimited dictionary entries, so that sentence embeddings of noticeably better quality are attainable. We abbreviate our approach as DefSent+ (a plus version of DefSent), involving the following strengths: 1) the task performance on measuring sentence similarities is significantly improved compared to DefSent; 2) when DefSent+ is used to further train data-augmented models like SIMCSE, SNCSE, and SynCSE, state-of-the-art performance on measuring sentence similarities can be achieved among the approaches without using manually labeled datasets; 3) DefSent+ is also competitive in feature-based transfer for NLP downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability, in: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Association for Computational Linguistics, Denver, Colorado. pp. 252–263. URL: https://aclanthology.org/S15-2045, doi:10.18653/v1/S15-2045.
  2. SemEval-2014 task 10: Multilingual semantic textual similarity, in: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational Linguistics, Dublin, Ireland. pp. 81–91. URL: https://aclanthology.org/S14-2010, doi:10.3115/v1/S14-2010.
  3. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation, in: Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), Association for Computational Linguistics, San Diego, California. pp. 497–511. URL: https://aclanthology.org/S16-1081, doi:10.18653/v1/S16-1081.
  4. *SEM 2013 shared task: Semantic textual similarity, in: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA. pp. 32–43. URL: https://aclanthology.org/S13-1004.
  5. Semeval-2012 task 6: A pilot on semantic textual similarity, in: Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, Association for Computational Linguistics, USA. p. 385–393.
  6. A large annotated corpus for learning natural language inference, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal. pp. 632–642. URL: https://aclanthology.org/D15-1075, doi:10.18653/v1/D15-1075.
  7. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation, in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics, Vancouver, Canada. pp. 1–14. URL: https://aclanthology.org/S17-2001, doi:10.18653/v1/S17-2001.
  8. Universal sentence encoder for English, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Brussels, Belgium. pp. 169–174. URL: https://aclanthology.org/D18-2029, doi:10.18653/v1/D18-2029.
  9. SentEval: An evaluation toolkit for universal sentence representations, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan. URL: https://aclanthology.org/L18-1269.
  10. Supervised learning of universal sentence representations from natural language inference data, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark. pp. 670–680. URL: https://aclanthology.org/D17-1070, doi:10.18653/v1/D17-1070.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171–4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
  12. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: Inui, K., Jiang, J., Ng, V., Wan, X. (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China. pp. 55–65. URL: https://aclanthology.org/D19-1006, doi:10.18653/v1/D19-1006.
  13. Representation degeneration problem in training natural language generation models, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=SkEYojRqtm.
  14. SimCSE: Simple contrastive learning of sentence embeddings, in: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (Eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic. pp. 6894–6910. URL: https://aclanthology.org/2021.emnlp-main.552, doi:10.18653/v1/2021.emnlp-main.552.
  15. Independent component analysis: algorithms and applications. Neural Networks 13, 411–430. URL: https://www.sciencedirect.com/science/article/pii/S0893608000000265, doi:https://doi.org/10.1016/S0893-6080(00)00026-5.
  16. Learning to describe unknown phrases with local and global contexts, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 3467–3476. URL: https://aclanthology.org/N19-1350, doi:10.18653/v1/N19-1350.
  17. PromptBERT: Improving BERT sentence embeddings with prompts, in: Goldberg, Y., Kozareva, Z., Zhang, Y. (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates. pp. 8826–8837. URL: https://aclanthology.org/2022.emnlp-main.603, doi:10.18653/v1/2022.emnlp-main.603.
  18. Skip-thought vectors, in: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper_files/paper/2015/file/f442d33fa06832082290ad8544a8da27-Paper.pdf.
  19. On the sentence embeddings from pre-trained language models, in: Webber, B., Cohn, T., He, Y., Liu, Y. (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online. pp. 9119–9130. URL: https://aclanthology.org/2020.emnlp-main.733, doi:10.18653/v1/2020.emnlp-main.733.
  20. Parameter-efficient feature-based transfer for paraphrase identification. Natural Language Engineering 29, 1066–1096. doi:10.1017/S135132492200050X.
  21. Roberta: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. URL: http://arxiv.org/abs/1907.11692, arXiv:1907.11692.
  22. A SICK cure for the evaluation of compositional distributional semantic models, in: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), European Language Resources Association (ELRA), Reykjavik, Iceland. pp. 216--223. URL: http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf.
  23. Wordnet: A lexical database for english. Commun. ACM 38, 39–41. URL: https://doi.org/10.1145/219717.219748, doi:10.1145/219717.219748.
  24. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 .
  25. Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 2825--2830. URL: http://jmlr.org/papers/v12/pedregosa11a.html.
  26. GloVe: Global vectors for word representation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1532--1543. URL: https://aclanthology.org/D14-1162, doi:10.3115/v1/D14-1162.
  27. A comprehensive survey of sentence representations: From the BERT epoch to the CHATGPT era and beyond, in: Graham, Y., Purver, M. (Eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, St. Julian’s, Malta. pp. 1738--1751. URL: https://aclanthology.org/2024.eacl-long.104.
  28. Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China. pp. 3982--3992. URL: https://aclanthology.org/D19-1410, doi:10.18653/v1/D19-1410.
  29. Ranking-enhanced unsupervised sentence representation learning, in: Rogers, A., Boyd-Graber, J., Okazaki, N. (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 15783--15798. URL: https://aclanthology.org/2023.acl-long.879, doi:10.18653/v1/2023.acl-long.879.
  30. Independent component analysis: an introduction. Trends in Cognitive Sciences 6, 59--64. URL: https://www.sciencedirect.com/science/article/pii/S1364661300018131, doi:https://doi.org/10.1016/S1364-6613(00)01813-1.
  31. Whitening sentence representations for better semantics and faster retrieval. CoRR abs/2103.15316. URL: https://arxiv.org/abs/2103.15316, arXiv:2103.15316.
  32. Independent component analysis: An introduction. Applied Computing and Informatics 17, 222--249.
  33. DefSent: Sentence embeddings using definition sentences, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for Computational Linguistics, Online. pp. 411--418. URL: https://aclanthology.org/2021.acl-short.52, doi:10.18653/v1/2021.acl-short.52.
  34. From frequency to meaning: Vector space models of semantics. J. Artif. Int. Res. 37, 141–188.
  35. Attention is all you need, in: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  36. Sncse: Contrastive learning for unsupervised sentence embedding with soft negative samples, in: Advanced Intelligent Computing Technology and Applications: 19th International Conference, ICIC 2023, Zhengzhou, China, August 10–13, 2023, Proceedings, Part IV, Springer-Verlag, Berlin, Heidelberg. p. 419–431. URL: https://doi.org/10.1007/978-981-99-4752-2_35, doi:10.1007/978-981-99-4752-2_35.
  37. Improving neural language generation with spectrum control, in: International Conference on Learning Representations. URL: https://openreview.net/forum?id=ByxY8CNtvr.
  38. ESimCSE: Enhanced sample building method for contrastive learning of unsupervised sentence embedding, in: Calzolari, N., Huang, C.R., Kim, H., Pustejovsky, J., Wanner, L., Choi, K.S., Ryu, P.M., Chen, H.H., Donatelli, L., Ji, H., Kurohashi, S., Paggio, P., Xue, N., Kim, S., Hahm, Y., He, Z., Lee, T.K., Santus, E., Bond, F., Na, S.H. (Eds.), Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju, Republic of Korea. pp. 3898--3907. URL: https://aclanthology.org/2022.coling-1.342.
  39. Discovering universal geometry in embeddings with ICA, in: Bouamor, H., Pino, J., Bali, K. (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Singapore. pp. 4647--4675. URL: https://aclanthology.org/2023.emnlp-main.283, doi:10.18653/v1/2023.emnlp-main.283.
  40. Learning semantic textual similarity from conversations, in: Proceedings of the Third Workshop on Representation Learning for NLP, Association for Computational Linguistics, Melbourne, Australia. pp. 164--174. URL: https://aclanthology.org/W18-3022, doi:10.18653/v1/W18-3022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets