Starbucks-v2: Improved Training for 2D Matryoshka Embeddings (2410.13230v3)

Published 17 Oct 2024 in cs.IR

Abstract: 2D Matryoshka training enables a single embedding model to generate sub-network representations across different layers and embedding dimensions, offering adaptability to diverse computational and task constraints. However, its effectiveness remains well below that of individually trained models of equivalent sizes. To address this, we propose Starbucks, a new training strategy for Matryoshka-style embedding models that combines structured fine-tuning with masked autoencoder (MAE) pre-training. During fine-tuning, we compute the loss over a fixed set of layer-dimension pairs, from small to large, which significantly improves performance over randomly sampled sub-networks and matches that of separately trained models. Our MAE-based pre-training further enhances the representation quality of sub-networks, providing a stronger backbone for downstream tasks. Experiments on both in-domain (semantic similarity and passage retrieval) and out-of-domain (BEIR) benchmarks show that Starbucks consistently outperforms 2D Matryoshka models and matches or exceeds the performance of individually trained models, while maintaining high efficiency and adaptability. Ablation studies confirm our loss design choices, the impact of SMAE pre-training and demonstrate the applicability of Starbucks across backbones. We further show that depth- and width-wise Starbucks variants capture complementary information, and that their hybridization yields additional performance gains with minimal latency overhead due to parallelization. Code available at https://github.com/ielab/Starbucks

Summary

The paper presents a two-phase training method combining precise fine-tuning (SRL) with masked autoencoding (SMAE) to improve embedding performance.
The paper demonstrates significant improvements in Spearman’s correlation and retrieval metrics (MRR@10, nDCG@10) on key benchmark datasets.
The paper suggests that Starbucks can bridge the gap between flexible multi-dimensional embeddings and independently tuned models, enhancing scalability and practical NLP applications.

Starbucks: Improved Training for 2D Matryoshka Embeddings

The paper introduces an innovative training strategy named Starbucks, designed to enhance the performance of 2D Matryoshka embedding models. This approach addresses key limitations observed in previous methods for training flexible embedding models, particularly the 2D Matryoshka Sentence Embeddings (2DMSE). Researchers observed that while 2DMSE allows for adaptable embedding generation across varying dimensions and layers, its effectiveness was consistently inferior to that of bespoke, smaller models trained independently.

Key Contributions

Starbucks introduces a two-phase training methodology:

Fine-Tuning (Starbucks Representation Learning - SRL): Instead of randomly sampling sub-layer and sub-dimension combinations during fine-tuning, Starbucks strategically computes loss over a fixed sequence of layer-dimension pairs. This targeted approach aligns training with practical application scenarios where a limited number of model sizes are required, improving the efficiency of the embedding models.
Pre-Training (Starbucks Masked Autoencoding - SMAE): Inspired by MatFormer and standard MAE techniques, Starbucks integrates masked autoencoding with a Matryoshka-like structure during the pre-training phase. This strategy encompasses applying LLMling to various sub-layers and sub-dimensions, establishing a robust backbone for SRL, and ensuring the model generalizes better over downstream applications.

Experimental Results

The research substantiates the superiority of Starbucks through comprehensive experiments on semantic text similarity (STS) and retrieval tasks. The Starbucks models not only match the efficiency of independently trained models but also surpass 2DMSE outputs, demonstrating significant improvements across benchmark datasets.

STS Tasks: The empirical results show an enhanced Spearman's correlation, with improvements consistently observed across different embedding sizes.
Retrieval Tasks: Starbucks models yield higher MRR@10 and nDCG@10 scores on datasets like MS MARCO, DL19, and DL20 compared to 2DMSE, showcasing their robust retrieval capabilities.

Implications and Future Directions

The Starbucks method demonstrates a critical advancement in training adaptable embedding models. By effectively narrowing the efficacy gap between dynamic embeddings and individually tuned counterparts, Starbucks augments the practicality and scalability of NLP models for varied computational budgets and task demands.

This research paves the way for further exploration in optimizing layer-dimension configurations. Future work might delve into automatic selection techniques, predicting optimal sizes for specific tasks, or extending the Starbucks methodology to other architectures beyond BERT. Further investigation could unveil methods to strike a balance between computational efficiency and model effectiveness, expanding the utility of Starbucks in adaptive AI systems.

Starbucks stands as a promising approach, streamlining the process of developing efficient, scalable, and versatile embedding models, suitable for diverse AI applications in NLP and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xmlee97/status/1847251102229725211

https://twitter.com/_reachsumit/status/1847163736605515975

https://twitter.com/din0s_/status/1847348088689610755

https://twitter.com/bo_wangbo/status/1847172684980871463

https://twitter.com/tomaarsen/status/1847206242327875761