Starbucks: Improved Training for 2D Matryoshka Embeddings (2410.13230v2)

Published 17 Oct 2024 in cs.IR

Abstract: Effective approaches that can scale embedding model depth (i.e. layers) and embedding size allow for the creation of models that are highly scalable across different computational resources and task requirements. While the recently proposed 2D Matryoshka training approach can efficiently produce a single embedding model such that its sub-layers and sub-dimensions can measure text similarity, its effectiveness is significantly worse than if smaller models were trained separately. To address this issue, we propose Starbucks, a new training strategy for Matryoshka-like embedding models, which encompasses both the fine-tuning and pre-training phases. For the fine-tuning phase, we discover that, rather than sampling a random sub-layer and sub-dimensions for each training steps, providing a fixed list of layer-dimension pairs, from small size to large sizes, and computing the loss across all pairs significantly improves the effectiveness of 2D Matryoshka embedding models, bringing them on par with their separately trained counterparts. To further enhance performance, we introduce a new pre-training strategy, which applies masked autoencoder LLMling to sub-layers and sub-dimensions during pre-training, resulting in a stronger backbone for subsequent fine-tuning of the embedding model. Experimental results on both semantic text similarity and retrieval benchmarks demonstrate that the proposed pre-training and fine-tuning strategies significantly improved the effectiveness over 2D Matryoshka models, enabling Starbucks models to perform more efficiently and effectively than separately trained models.

Summary

The paper presents a two-phase training method combining precise fine-tuning (SRL) with masked autoencoding (SMAE) to improve embedding performance.
The paper demonstrates significant improvements in Spearman’s correlation and retrieval metrics (MRR@10, nDCG@10) on key benchmark datasets.
The paper suggests that Starbucks can bridge the gap between flexible multi-dimensional embeddings and independently tuned models, enhancing scalability and practical NLP applications.

Starbucks: Improved Training for 2D Matryoshka Embeddings

The paper introduces an innovative training strategy named Starbucks, designed to enhance the performance of 2D Matryoshka embedding models. This approach addresses key limitations observed in previous methods for training flexible embedding models, particularly the 2D Matryoshka Sentence Embeddings (2DMSE). Researchers observed that while 2DMSE allows for adaptable embedding generation across varying dimensions and layers, its effectiveness was consistently inferior to that of bespoke, smaller models trained independently.

Key Contributions

Starbucks introduces a two-phase training methodology:

Fine-Tuning (Starbucks Representation Learning - SRL): Instead of randomly sampling sub-layer and sub-dimension combinations during fine-tuning, Starbucks strategically computes loss over a fixed sequence of layer-dimension pairs. This targeted approach aligns training with practical application scenarios where a limited number of model sizes are required, improving the efficiency of the embedding models.
Pre-Training (Starbucks Masked Autoencoding - SMAE): Inspired by MatFormer and standard MAE techniques, Starbucks integrates masked autoencoding with a Matryoshka-like structure during the pre-training phase. This strategy encompasses applying LLMling to various sub-layers and sub-dimensions, establishing a robust backbone for SRL, and ensuring the model generalizes better over downstream applications.

Experimental Results

The research substantiates the superiority of Starbucks through comprehensive experiments on semantic text similarity (STS) and retrieval tasks. The Starbucks models not only match the efficiency of independently trained models but also surpass 2DMSE outputs, demonstrating significant improvements across benchmark datasets.

STS Tasks: The empirical results show an enhanced Spearman's correlation, with improvements consistently observed across different embedding sizes.
Retrieval Tasks: Starbucks models yield higher MRR@10 and nDCG@10 scores on datasets like MS MARCO, DL19, and DL20 compared to 2DMSE, showcasing their robust retrieval capabilities.

Implications and Future Directions

The Starbucks method demonstrates a critical advancement in training adaptable embedding models. By effectively narrowing the efficacy gap between dynamic embeddings and individually tuned counterparts, Starbucks augments the practicality and scalability of NLP models for varied computational budgets and task demands.

This research paves the way for further exploration in optimizing layer-dimension configurations. Future work might delve into automatic selection techniques, predicting optimal sizes for specific tasks, or extending the Starbucks methodology to other architectures beyond BERT. Further investigation could unveil methods to strike a balance between computational efficiency and model effectiveness, expanding the utility of Starbucks in adaptive AI systems.

Starbucks stands as a promising approach, streamlining the process of developing efficient, scalable, and versatile embedding models, suitable for diverse AI applications in NLP and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xmlee97/status/1847251102229725211

https://twitter.com/_reachsumit/status/1847163736605515975

https://twitter.com/din0s_/status/1847348088689610755

https://twitter.com/bo_wangbo/status/1847172684980871463

https://twitter.com/tomaarsen/status/1847206242327875761