Dice Question Streamline Icon: https://streamlinehq.com

Optimal layer–dimension configurations for Starbucks embeddings

Determine whether there exist specific combinations of transformer encoder layer counts and embedding dimensions in BERT-based embedding models trained with Starbucks Representation Learning (and optionally Starbucks Masked Autoencoding pre-training) that yield higher effectiveness than configurations obtained by simply increasing layers and dimensions.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper introduces Starbucks, a training strategy for Matryoshka-like embedding models that targets a fixed list of layer–dimension pairs across both fine-tuning (Starbucks Representation Learning, SRL) and pre-training (Starbucks Masked Autoencoding, SMAE). Experiments use BERT-base encoders and evaluate on semantic text similarity and passage retrieval tasks.

While results show effectiveness generally improves when increasing both layer count and embedding dimensionality, the authors only evaluate six predefined configurations and did not explore the full space of possible layer–dimension combinations. They explicitly note an unresolved question about whether certain specific combinations might outperform the monotonic trend of increasing size.

References

Our results show that increasing dimension and layer numbers always led to improvements in effectiveness; however, we still do not know if there are specific combinations of layers and dimensions that would be more effective.

Starbucks-v2: Improved Training for 2D Matryoshka Embeddings (2410.13230 - Zhuang et al., 17 Oct 2024) in Section: Limitations