Contextualized Dynamic Meta-Embeddings

Updated 19 May 2026

The paper introduces CDME as an unsupervised framework that combines multiple contextualized language models into a unified sentence embedding.
It projects source token embeddings into a common meta-space and applies dynamic token-level attention to weight contributions based on context.
Empirical results show that CDME outperforms traditional unsupervised methods and competes with supervised approaches on semantic similarity tasks.

Contextualized Dynamic Meta-Embeddings (CDME) are a framework for producing sentence-level representations by combining multiple independently trained contextualized LLMs into a unified meta-embedding. CDME is designed to maximize the complementary strengths of source models, accommodate differences in dimensionality, and remain fully unsupervised and task-agnostic. The architecture operates at the token level by projecting source embeddings into a shared meta-space, weighting their contributions via learnable attention, and producing the sentence-level vector through pooling. Empirical evaluation on semantic textual similarity (STS) benchmarks demonstrates that CDME achieves superior performance compared to both unsupervised and some supervised baselines, without reliance on labeled data (Takahashi et al., 2022).

1. Motivation and Design Rationale

Conventional static meta-embedding methods, designed for non-contextual vectors, are ill-suited to contextualized LLMs (NLMs) such as BERT, RoBERTa, and ELMo due to three factors:

Context dependence: Token vectors from NLMs vary with sentential context.
Heterogeneous dimensionalities: Source models produce vectors in differing dimensions, precluding naive concatenation for more than a few sources.
Unsupervised generality: Fine-tuning for each task is costly; reusable, unsupervised sentence embeddings are preferred.

CDME addresses these challenges by (1) working at the token level with context-sensitive meta-embeddings, (2) mapping all sources into a unified meta-space, and (3) learning dynamic—context- and token-sensitive—attention weights for fusion. No source LLMs are fine-tuned; all pretrained parameters remain frozen, preserving modularity and scalability.

2. Source Embeddings and Meta-Space Projection

For sentence $s = (w_1, w_2, \ldots, w_T)$ , $n$ pretrained NLMs yield contextualized token embeddings $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ , with $i$ indexing the source and $t$ the token. To facilitate integration, source embeddings are projected into a common meta-space of dimension $d_m$ using trainable linear maps $P_i \in \mathbb{R}^{d_m \times d_i}$ . This yields,

$z_{i,t} = P_i x_{i,t} \in \mathbb{R}^{d_m}.$

Each $P_i$ is regularized towards orthonormality to maintain variance and avoid rank deficiency.

3. Token-Level Dynamic Attention

Token-level fusion leverages a learnable attention over sources, enabling context-sensitive weighting. Each source $i$ is assigned an attention vector $n$ 0. For token $n$ 1, attention logits are computed:

$n$ 2

and normalized with softmax across sources,

$n$ 3

The meta-embedding at position $n$ 4 is a weighted sum,

$n$ 5

4. Sentence-Level Pooling and Embedding Extraction

The final sentence-level embedding is obtained by pooling over the token meta-embeddings $n$ 6. CDME supports two pooling schemes:

Mean pooling:

$n$ 7

Element-wise max pooling:

$n$ 8

Max pooling consistently showed slightly superior STS correlation in reported experiments and is the default pooling operation.

5. Unsupervised Objective and Optimization

All learnable parameters are updated by minimizing an unsupervised loss, combining four desiderata for token projections:

Same token, same context: $n$ 9 and $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 0 for identical $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 1 across sources should be close.
Different tokens, same context: Distinct tokens in the same sentence should remain distinguishable.
Same token, different contexts: The same word in different sentences should have context-dependent representations.
Different tokens, different contexts: Different words in different sentences should be maximally separated.

The composite loss is

$x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 2

with all terms precisely defined in the source and hyperparameters $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 3 tuned on held-out data. The loss does not reference attention weights $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 4 directly; practical learning is supported by a small auxiliary term for attention–see original Appendix for details.

The optimizer is stochastic gradient descent (SGD) with weight decay $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 5, learning rate $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 6, and batch size 512. Early stopping uses Pearson correlation on STS-B development data, with convergence reported in $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 78 hours for two sources on a single Quadro RTX 8000 GPU.

6. Empirical Performance and Comparative Analysis

Evaluation follows canonical STS benchmarks (STS-15, STS-16, STS-B). For each sentence pair $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 8, the cosine similarity of their meta-embeddings is measured and compared to human annotations via Pearson $x_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}$ 9 and Spearman $i$ 0.

Method	STS-15	STS-16	STS-B
SSE–BERT	87.15/87.34	81.98/83.05	82.40/82.74
SSE–RoBERTa	87.72/87.88	84.55/84.93	82.46/83.26
CONC	88.64/88.59	84.27/84.88	84.69/85.14
AVG	88.47/88.45	83.97/84.63	84.21/84.47
SVD	88.66/88.65	84.07/84.63	83.98/84.61
GCCA	88.58/88.58	84.00/84.53	83.36/84.14
SUP	89.34/89.30	85.11/85.72	65.21/64.79
UNSUP (CDME)	88.76/88.85	85.06/85.33	85.33/86.08

Observations: CDME’s unsupervised meta-embeddings (“UNSUP”) surpass all prior unsupervised methods (CONC, AVG, SVD, GCCA) on every split, and exceed the supervised meta-embedding “SUP” on STS-15 and STS-16. On STS-B, where SUP is directly trained on the same labels, CDME remains competitive. The improvements are statistically significant at $i$ 1 in most settings. Ablations confirm the necessity of dynamic token-level attention and max pooling; setting $i$ 2 (uniform attention) degrades STS-B performance from 85.33/86.08 to 81.98/83.00.

7. Strengths, Limitations, and Future Directions

Strengths:

Fully unsupervised, requiring only raw sentence corpora and no labeled similarity data.
Highly modular, supporting any number of source models of arbitrary dimensionality through compact projections.
Dynamic source selection, allowing attention to reflect token- and context-specific reliability.
Yields robust, context-sensitive sentence embeddings outperforming contemporary unsupervised and some supervised methods.

Limitations and Open Questions:

Empirical results are limited to $i$ 3 sources of equal dimensionality; extension to larger, more heterogeneous ensembles remains untested at scale.
The attention mechanism is a source-specific dot product; extensions to richer multi-head or context-aware mechanisms are plausible avenues for improved expressivity.
Present formulation does not process subword-level or hierarchical input structure, limiting granularity and potential performance on longer or nested sequences.
Linear projections suffice in current experiments, but more expressive nonlinear mappings (e.g., small feed-forward networks) could offer additional flexibility.
No ablation evaluates direct integration of recent generator or discriminative LMs (e.g., GPT, XLNet, ELECTRA) into the meta-embedding workflow.

CDME establishes a lightweight, unsupervised, and universally adaptable framework for synthesizing multiple contextual LLMs into a unified sentence embedding, with demonstrated superiority on standard STS tasks (Takahashi et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Unsupervised Attention-based Sentence-Level Meta-Embeddings from Contextualised Language Models (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contextualized Dynamic Meta-Embeddings (CDME).