2D Matryoshka Sentence Embedding (2DMSE)

Updated 15 December 2025

The paper introduces 2DMSE, a unified framework that adapts transformer depth and embedding width to balance efficiency and accuracy.
It employs multi-level supervision with KL-divergence alignment and Matryoshka-style masked autoencoder pre-training to enhance shallow representations.
Experiments show up to 1.5× speedup on semantic tasks, making 2DMSE ideal for deployment in resource-constrained environments.

The two-dimensional Matryoshka Sentence Embedding (2DMSE) paradigm is a unified approach to sentence representation using transformer encoders that supports elastic adaptation along both layer (depth) and embedding width (dimension). This methodology enables a single model to return high-quality embeddings from any intermediate transformer layer and truncated embedding size, offering substantial flexibility for systems operating under varying latency, memory, or accuracy constraints. 2DMSE generalizes the earlier Matryoshka Representation Learning (MRL), with extensive experimentation demonstrating significant efficiency gains as well as improved effectiveness for shallow and low-dimensional representations. More recent work, notably Starbucks-v2, advances 2DMSE through structured fine-tuning and tailored masked autoencoder pre-training, closing the performance gap with individually tuned small models (Li et al., 22 Feb 2024, Wang et al., 26 Nov 2024, Zhuang et al., 17 Oct 2024).

1. Mathematical Foundations and Model Construction

2DMSE defines the embedding extraction space via a transformer encoder with $N$ layers and hidden size $D$ , denoted as $f(x) = \{f_{n,d}: n\in\{1,\dots,N\}, d\in\{1,\dots,D\}\}$ , where $f_{n,d}(x) \in \mathbb{R}^d$ denotes the $d$ -dimensional pooled embedding output from the $n$ -th layer. The essential operation comprises running only the first $n$ layers for an input $x$ , pooling (commonly over CLS or mean), and truncating or projecting the output to the first $d$ dimensions.

In contrast to 1D MRL—which slices only the embedding width on the final layer—2DMSE empowers both axes, permitting arbitrary $(n,d)$ choices at inference. This enables compute-efficient inferences by, for example, halving the layers traversed and shrinking vector storage without retraining. Formally, for input $x$ ,

$V_n^d = \mathrm{encoder}_{1:n}(x)_{1:d}\in\mathbb R^{|x|\times d}, \quad v_n^d = \mathrm{pool}(V_n^d)\in\mathbb R^d$

The set of possible sub-models forms a 2D grid indexed by layer and dimension (Li et al., 22 Feb 2024).

2. Training Objectives and Optimization Strategies

The classic 2DMSE optimization objective involves simultaneous multi-level supervision, with losses computed on embeddings from sampled layers and dimensions. The canonical formulation is: $\mathcal{L}_{\rm 2D} = \ell(v_{n}^d) + \ell(v_{n}^D) + \ell(v_N^d) + \ell(v_N^D)$ where $\ell(\cdot)$ is a task-specific loss (e.g., contrastive, ranking, or angular).

A key component is a KL-divergence alignment loss that regularizes shallow representations by aligning their scoring distributions to the full model: $\mathcal{L}_{\rm align} = \mathrm{KLD}(s_{n}^d, s_N^D)$ The overall objective typically utilizes equal or tuned weighting for each component.

Recent advances clarify two key published implementations:

Version 1: Random sampling strategy over layers, aligning each sampled shallow representation against the last layer using KL-divergence (Wang et al., 26 Nov 2024).
Version 2: Jointly trains all layers and dimensions via a logarithmically weighted sum over both axes, incorporating PCA projections and a combination of MSE and KL-divergence losses for each target sub-model:

$\mathcal{L}^{\rm V2} = \alpha \mathcal{L}_{\rm layer} + \beta \mathcal{L}_{\rm dim} + \ell(\mathbf{e}^{(L)}, \mathbf{y})$

with layer-dependent weights $w_i$ and hyperparameters $\alpha, \beta$ .

Starbucks-v2 introduces structured fine-tuning, using a fixed, ordered set $S = \{(n_k, d_k)\}$ of layer-dimension pairs. Instead of random sampling, the loss is averaged exhaustively over $S$ : $\mathcal{L}_S = \frac{1}{|S|} \sum_{(n,d)\in S} \ell(v_n^d; A)$

$\mathcal{L} = \mathcal{L}_S + \mathcal{L}_{\rm KL}$

Empirically, this structured approach yields markedly higher small-model accuracy and closes gaps with independently tuned models (Zhuang et al., 17 Oct 2024).

3. Masked Autoencoding and Pre-training Enhancements

Starbucks-v2 further augments 2DMSE with a Matryoshka-style Masked Autoencoder (SMAE) pre-training. For each sub-network in $S$ , the encoder–decoder MAE reconstructs masked tokens, projecting sub-layer outputs back to full dimension via a learnable $W \in \mathbb{R}^{D \times D}$ . The corresponding loss is: $\mathcal{L}_{\rm SMAE} = \frac{1}{|S|} \sum_{(n,d)\in S} \left( \mathcal{L}_{\rm enc}(n,d) + \mathcal{L}_{\rm dec}(n,d) \right)$ where

$\mathcal{L}_{\rm enc}(n,d) = -\frac{1}{|M_{\rm enc}|} \sum_{i\in M_{\rm enc}} \log P(x_i|V_n^d), \quad \mathcal{L}_{\rm dec}(n,d) = -\frac{1}{|M_{\rm dec}|} \sum_{i\in M_{\rm dec}} \log P(x_i|\vec x_{\rm dec}^{\rm cls}, x_{\rm dec})$

This targeted pre-training preserves information content for shallow layers and low dimensions, substantially improving downstream effectiveness (Zhuang et al., 17 Oct 2024). Dropout of the decoder post-pre-training ensures that only the encoder backbone is carried forward to fine-tuning.

4. Empirical Results and Comparative Performance

2DMSE achieves clear performance gains in semantic text similarity (STS) and retrieval tasks. On STS, full-capacity models reach avg. Spearman’s $\rho = 82.65$ (2DMSE), outperforming SBERT (74.89) and MRL (82.57) at comparable settings (Li et al., 22 Feb 2024). For moderate $(n,d)$ such as $(6,512)$ , 2DMSE delivers up to 1.5X inference speedup with limited accuracy trade-off.

Table: Performance Summary on In-domain Tasks (Starbucks-v2) | Method | STS avg $\rho$ | MS MARCO MRR | DL19 nDCG | DL20 nDCG | |------------------|:--------------:|:------------:|:---------:|:---------:| | BERT (full) | 0.6522 | 0.0559 | 0.1084 | 0.1076 | | BERT-Separate | 0.7644 | 0.2917 | 0.5859 | 0.5857 | | BERT-2DMSE | 0.7338 | 0.2560 | 0.5218 | 0.5408 | | BERT-SRL | 0.7682 | 0.2921 | 0.6009 | 0.6126 | | SMAE-SRL | 0.7837 | 0.3116 | 0.6135 | 0.6039 |

Original 2DMSE models underperform BERT-Separate for small sizes; Starbucks’s structure and pre-training match or surpass separate models across most benchmarks.

Zero-shot generalization is partially validated; transfer from MS MARCO to TREC DL is robust, suggesting broader utility for BEIR benchmarks (Zhuang et al., 17 Oct 2024, Wang et al., 26 Nov 2024).

5. Ablations, Loss Variants, and Analysis

Analyses underscore the necessity of KL-divergence alignment (0.08–0.15 $\rho$ lift) and last-layer supervision (1.5 $\rho$ drop if omitted) (Li et al., 22 Feb 2024). Ablating SMAE pre-training reveals that standard MLM improves smaller $(n,d)$ but SMAE is superior overall for large sub-networks and retrieval (Zhuang et al., 17 Oct 2024).

Key loss variant studies:

Incorporating full-dimension losses per layer restores high-dim accuracy ("FULL-DIM") (Wang et al., 26 Nov 2024).
Targeting multiple sub-dimensions ("+DIMS") smooths accuracy for small dimensions with minor loss for larger (Wang et al., 26 Nov 2024).
Scoring-focused KL losses marginally stabilize high-dim retrieval, but do not correct low-dim collapse (Wang et al., 26 Nov 2024).
Fixed-document encoder approaches are suboptimal except under specific resource constraints (Wang et al., 26 Nov 2024).

6. Efficiency, Adaptability, and Implementation Guidelines

2DMSE’s primary efficiency gains arise from reduced transformer depth ( $O(n)$ cost vs $O(N)$ ) and smaller embedding size. Halving $n$ nearly halves latency and memory; empirical measurements confirm $1.5\times$ speedup at moderate $n$ (Li et al., 22 Feb 2024).

Practical recommendations:

Select sub-network pairs $S$ tailored to deployment hardware.
Pre-train with SMAE for robust small-model initialization.
Fine-tune using structured loss over $S$ , including KL alignment.
At inference, choose $(n,d)$ to satisfy latency, memory, or task constraints without separate retraining.

The framework is architecture-agnostic; RoBERTa, DeBERTa, and LLaMA-style transformers can be used (Zhuang et al., 17 Oct 2024).

7. Limitations, Extensions, and Future Directions

While 2DMSE demonstrates strong elasticity and performance for STS and retrieval, caveats remain. Shallow or extremely low-dimensional representations still lag full-capacity quality. Most efficacy is reported for STS and passage retrieval; other downstream tasks require validation. Multi-task training with extensive sampling incurs overhead, though parallelization amortizes much of the extra cost (Zhuang et al., 17 Oct 2024).

Future research directions outlined include:

Designing alternate layer-weighting and dimensionality schedules;
Exploring retrieval-specific loss formulations;
Expanding to diverse backbone architectures and data domains;
Investigating hybrid depth/width variants for complementary information capture.

8. Synthesis and Significance

2DMSE rigorously generalizes Matryoshka learning and offers a principled means to produce on-demand sentence embeddings across a wide accuracy–efficiency frontier. Structured fine-tuning and Matryoshka-style MAE pre-training (as in Starbucks-v2) are shown to be essential for matching the performance of independently tuned small models. Empirical work establishes 2DMSE as a robust embedding protocol for modern NLP deployments seeking single-model adaptability and efficiency (Li et al., 22 Feb 2024, Wang et al., 26 Nov 2024, Zhuang et al., 17 Oct 2024).

PDF Markdown Chat (Pro)

References (3)

2D Matryoshka Sentence Embeddings (2024)

2D Matryoshka Training for Information Retrieval (2024)

Starbucks-v2: Improved Training for 2D Matryoshka Embeddings (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to 2D Matryoshka Sentence Embedding (2DMSE).