Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

2D Matryoshka Sentence Embeddings (2402.14776v3)

Published 22 Feb 2024 in cs.CL and cs.LG

Abstract: Common approaches rely on fixed-length embedding vectors from LLMs as sentence embeddings for downstream tasks such as semantic textual similarity (STS). Such methods are limited in their flexibility due to unknown computational constraints and budgets across various applications. Matryoshka Representation Learning (MRL) \cite{aditya2022matryoshka} encodes information at finer granularities, i.e., with lower embedding dimensions, to adaptively accommodate \emph{ad hoc} tasks. Similar accuracy can be achieved with a smaller embedding size, leading to speedups in downstream tasks. Despite its improved efficiency, MRL still requires traversing all Transformer layers before obtaining the embedding, which remains the dominant factor in time and memory consumption. This prompts consideration of whether the fixed number of Transformer layers affects representation quality and whether using intermediate layers for sentence representation is feasible. In this paper, we introduce a novel sentence embedding model called \textit{Two-dimensional Matryoshka Sentence Embedding} (2DMSE)\footnote{Our code is available at \url{https://github.com/SeanLee97/AnglE/blob/main/README_2DMSE.md}.}. It supports elastic settings for both embedding sizes and Transformer layers, offering greater flexibility and efficiency than MRL. We conduct extensive experiments on STS tasks and downstream applications. The experimental results demonstrate the effectiveness of our proposed model in dynamically supporting different embedding sizes and Transformer layers, allowing it to be highly adaptable to various scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. SemEval-2015 task 2: Semantic textual similarity, English, Spanish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 252–263, Denver, Colorado. Association for Computational Linguistics.
  2. SemEval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 81–91, Dublin, Ireland. Association for Computational Linguistics.
  3. SemEval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), pages 497–511, San Diego, California. Association for Computational Linguistics.
  4. SemEval-2012 task 6: A pilot on semantic textual similarity. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 385–393, Montréal, Canada. Association for Computational Linguistics.
  5. *SEM 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pages 32–43, Atlanta, Georgia, USA. Association for Computational Linguistics.
  6. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  7. Semantic re-tuning with contrastive tension. In International conference on learning representations.
  8. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
  9. Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 169–174, Brussels, Belgium. Association for Computational Linguistics.
  10. DiffCSE: Difference-based contrastive learning for sentence embeddings. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4207–4218, Seattle, United States. Association for Computational Linguistics.
  11. Alexis Conneau and Douwe Kiela. 2018. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  12. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
  13. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
  14. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910. Association for Computational Linguistics.
  15. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
  16. DeCLUTR: Deep contrastive learning for unsupervised textual representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 879–895, Online. Association for Computational Linguistics.
  17. Jay Hegdé. 2008. Time course of visual perception: coarse-to-fine processing and beyond. Progress in neurobiology, 84(4):405–439.
  18. Improved universal sentence embeddings with prompt-based contrastive learning and energy-based learning. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3021–3035. Association for Computational Linguistics.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  20. Matryoshka representation learning. In Advances in Neural Information Processing Systems, volume 35, pages 30233–30249. Curran Associates, Inc.
  21. Xianming Li and Jing Li. 2023a. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871.
  22. Xianming Li and Jing Li. 2023b. Deelm: Dependency-enhanced large language model for sentence embeddings. arXiv preprint arXiv:2311.05296.
  23. Xianming Li and Jing Li. 2024. Generative deduplication for socia media data selection. arXiv preprint arXiv:2401.05883.
  24. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 216–223, Reykjavik, Iceland. European Language Resources Association (ELRA).
  25. Distributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems 2013., pages 3111–3119.
  26. OpenAI. 2022. Introducing chatgpt.
  27. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pages 3980–3990. Association for Computational Linguistics.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  29. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368.
  30. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  31. DistillCSE: Distilled contrastive learning for sentence embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8153–8165. Association for Computational Linguistics.
  32. Consert: A contrastive framework for self-supervised sentence representation transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5065–5075. Association for Computational Linguistics.
  33. An unsupervised sentence embedding method by mutual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1601–1610, Online. Association for Computational Linguistics.
  34. WhitenedCSE: Whitening-based contrastive learning of sentence embeddings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12135–12148, Toronto, Canada. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xianming Li (11 papers)
  2. Zongxi Li (10 papers)
  3. Jing Li (621 papers)
  4. Haoran Xie (106 papers)
  5. Qing Li (430 papers)
Citations (3)

Summary

Enhancing Sentence Embedding Flexibility with 2D Matryoshka Sentence Embeddings

Introduction to 2D Matryoshka Sentence Embeddings (2DMSE)

The landscape of sentence embedding has been significantly advanced by the introduction of Two-dimensional Matryoshka Sentence Embeddings (2DMSE). This novel framework inherits the principles of Matryoshka Representation Learning (MRL), extending its capabilities to harness both the depth of Transformer layers and the granularity of embedding sizes for sentence embedding tasks. In this context, 2DMSE distinguishes itself by offering a unique approach to generate sentence embeddings that are not only efficient but also maintain high semantic accuracy across a variety of benchmarks.

Technical Overview

2DMSE is conceptualized around the flexibility of generating sentence embeddings through the integration of elastic settings for both model depth and embedding dimensions. This two-dimensional scalability allows for significant improvements in both computational efficiency and flexibility in embedding adaptation. The framework employs a process where, at each training step, a Transformer layer is randomly sampled, and embeddings from this layer, alongside those from the last layer, are fine-tuned using a customized loss function. This is complemented by an alignment method that minimizes the Kullback-Leibler divergence between embeddings from the chosen layer and the last layer, facilitating a coherence in semantic representation across the model's depth.

Experimental Insights

The efficacy of 2DMSE has been comprehensively established through extensive experiments on standard Semantic Textual Similarity (STS) benchmarks. Key findings include:

  • The capability of 2DMSE to produce embedding vectors from intermediate Transformer layers that exhibit significant qualitative improvements over those generated by traditional sentence embedding methods and the MRL framework.
  • Demonstrated superiority in embedding performance across various layers and dimensions, with 2DMSE achieving remarkable scores on STS benchmarks that exceed those of powerful baselines including SBERT, USE, and the state-of-the-art AnglE framework.

Theoretical and Practical Implications

From a theoretical standpoint, 2DMSE introduces a paradigm shift in how sentence embeddings are generated, leveraging the inherently hierarchical structure of language representation within Transformer models to a fuller extent. Practically, it equips users with an unmatched level of adaptability, enabling the tailoring of embedding generation to specific computational budgets without sacrificing performance. This adaptability is particularly crucial in resource-constrained environments, where efficiency in processing and memory usage is paramount.

Looking Forward

The introduction of 2DMSE opens up new avenues for research and application in the field of NLP. Future explorations could delve into the optimization of the layer sampling strategy or the exploration of different alignment methodologies to further boost the quality of sentence embeddings. Additionally, the framework's inherent flexibility suggests potential for widespread adoption in diverse NLP tasks ranging from information retrieval to real-time language understanding systems.

In conclusion, the Two-dimensional Matryoshka Sentence Embeddings framework sets a new benchmark in the development of scalable, efficient, and high-performance sentence embedding models. By addressing the dual aspects of model depth and embedding size adaptability, 2DMSE paves the way for more nuanced and versatile approaches to capturing semantic nuances in textual data.