Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples (2402.15132v2)

Published 23 Feb 2024 in cs.CL and cs.LG
Improving Sentence Embeddings with Automatic Generation of Training Data Using Few-shot Examples

Abstract: Decoder-based LLMs have shown high performance on many tasks in natural language processing. This is also true for sentence embedding learning, where a decoder-based model, PromptEOL, has achieved the best performance on semantic textual similarity (STS) tasks. However, PromptEOL requires a manually annotated natural language inference (NLI) dataset for fine-tuning. We aim to improve sentence embeddings without using large manually annotated datasets by automatically generating an NLI dataset with an LLM and using it for fine-tuning of PromptEOL. To achieve this, we explore methods of data generation suitable for sentence embedding learning in this study. Specifically, we will focus on automatic dataset generation through few-shot learning and explore the appropriate methods to leverage few-shot examples. Experimental results on the STS tasks demonstrate that our approach outperforms existing models in settings without large manually annotated datasets.

Enhancing Sentence Embeddings via Automatically Generated NLI Datasets

Introduction

The quest for learning sophisticated sentence embeddings has led to various methodologies, notably the fine-tuning of pre-trained LLMs. Historically, encoder-based models, such as SentenceBERT and PromptBERT, have taken center stage. Lately, however, the utilization of decoder-based LLMs has shown promising results in tasks across the NLP spectrum, including Semantic Textual Similarity (STS). A significant breakthrough was achieved with the PromptEOL model, designed to predict an entire sentence's meaning with a single word through prompting. Despite its superior performance on STS tasks, its dependency on large, manually annotated Natural Language Inference (NLI) datasets poses a limitation. Addressing this, the paper introduces a method to generate NLI datasets automatically, leveraging LLM capabilities to fine-tune the PromptEOL model for enhanced sentence embeddings without relying on extensive manual annotations.

PromptEOL: A Focused Analysis

PromptEOL distinguishes itself by harnessing prompts to extract sentence embeddings from decoder-based LLMs. This novel approach employs a specially crafted prompt to encapsulate the semantic entirety of a sentence into a single word, leveraging the pre-training objective of LLMs centered around next-token prediction. Fine-tuning on NLI datasets further refines the embeddings to emphasize entailment and contradiction relations, foundational to generating semantically rich embeddings.

Automatic NLI Dataset Generation

The cornerstone of this research is the innovative method for automatic NLI dataset generation. By employing simple prompts to transform premise sentences into hypotheses with labels of entailment or contradiction, the paper bypasses the extensive manual effort. To enhance the quality of these hypothesis sentences, the paper incorporates few-shot learning, sequentially increasing the sophistication of dataset generation from 0-shot to 20-shot learning, with the latter matching the quality of manually curated datasets.

Empirical Evaluation

The paper's empirical investigations showcase the generated NLI dataset's superior quality and its efficacy in training the PromptEOL model to achieve remarkable performances in STS tasks. The model, fine-tuned with datasets obtained from 20-shot learning, rivaled the scores of manually annotated datasets, underscoring the potential of automatically generated NLI data in learning high-quality sentence embeddings. Noteworthy is the model's average Spearman’s rank correlation coefficient of 82.21 on STS benchmarks, highlighting the effectiveness of this methodology over existing unsupervised approaches and setting a new precedent for the use of NLI datasets in sentence embedding learning.

Conclusion and Prospects for Future Work

The proposed framework offers a novel pathway to obtain sentence embeddings by leveraging automatically generated NLI datasets, significantly reducing the dependency on large, manually annotated corpora. While the results on STS tasks are promising, the paper also acknowledges limitations, including the exclusive use of the Llama-2-7b model and the focus on English. Future explorations could extend to other LLMs and languages to broaden the applicability and utility of this approach.

This research illuminates the path forward in the development of more efficient sentence embedding methodologies that can potentially adapt to various languages and models, promising an exciting avenue for further exploration in the domain of Natural Language Processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393.
  2. *SEM 2013 shared task: Semantic Textual Similarity. In Second Joint Conference on Lexical and Computational Semantics (*SEM 2013), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 32–43.
  3. SemEval-2015 Task 2: Semantic Textual Similarity, English, Spanish and Pilot on Interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 252–263.
  4. SemEval-2014 Task 10: Multilingual Semantic Textual Similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 81–91.
  5. SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), pages 497–511.
  6. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pages 632–642.
  7. Language models are few-shot learners. arXiv:2005.14165.
  8. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017), pages 1–14.
  9. Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), page 1699–1704.
  10. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing(EMNLP 2018), pages 2475–2485.
  11. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NACCL 2019), pages 4171–4186.
  12. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), pages 6894–6910.
  13. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. In International Conference on Learning Representations (ICRL 2021).
  14. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
  15. Unsupervised Dense Information Retrieval with Contrastive Learning. Proceedings of the 3rd Workshop on Neural Generation and Translation.
  16. Scaling Sentence Embeddings with Large Language Models. arXiv:2307.16645.
  17. PromptBERT: Improving BERT Sentence Embeddings with Prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), pages 8826–8837.
  18. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 6769–6781.
  19. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pages 216–223.
  20. Niklas Muennighoff. 2022. SGPT: GPT Sentence Embeddings for Semantic Search. arXiv preprint arXiv:2202.08904.
  21. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
  22. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models. In Findings of the Association for Computational Linguistics (ACL 2022), pages 1864–1874.
  23. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP 2019), pages 3982–3992.
  24. DefSent: Sentence Embeddings using Definition Sentences. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL 2021) and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 411–418.
  25. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NACCL 2018), pages 1112–1122.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Soma Sato (1 paper)
  2. Hayato Tsukagoshi (8 papers)
  3. Ryohei Sasano (24 papers)
  4. Koichi Takeda (21 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets