Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contextualized Dynamic Meta-Embeddings

Updated 19 May 2026
  • The paper introduces CDME as an unsupervised framework that combines multiple contextualized language models into a unified sentence embedding.
  • It projects source token embeddings into a common meta-space and applies dynamic token-level attention to weight contributions based on context.
  • Empirical results show that CDME outperforms traditional unsupervised methods and competes with supervised approaches on semantic similarity tasks.

Contextualized Dynamic Meta-Embeddings (CDME) are a framework for producing sentence-level representations by combining multiple independently trained contextualized LLMs into a unified meta-embedding. CDME is designed to maximize the complementary strengths of source models, accommodate differences in dimensionality, and remain fully unsupervised and task-agnostic. The architecture operates at the token level by projecting source embeddings into a shared meta-space, weighting their contributions via learnable attention, and producing the sentence-level vector through pooling. Empirical evaluation on semantic textual similarity (STS) benchmarks demonstrates that CDME achieves superior performance compared to both unsupervised and some supervised baselines, without reliance on labeled data (Takahashi et al., 2022).

1. Motivation and Design Rationale

Conventional static meta-embedding methods, designed for non-contextual vectors, are ill-suited to contextualized LLMs (NLMs) such as BERT, RoBERTa, and ELMo due to three factors:

  • Context dependence: Token vectors from NLMs vary with sentential context.
  • Heterogeneous dimensionalities: Source models produce vectors in differing dimensions, precluding naive concatenation for more than a few sources.
  • Unsupervised generality: Fine-tuning for each task is costly; reusable, unsupervised sentence embeddings are preferred.

CDME addresses these challenges by (1) working at the token level with context-sensitive meta-embeddings, (2) mapping all sources into a unified meta-space, and (3) learning dynamic—context- and token-sensitive—attention weights for fusion. No source LLMs are fine-tuned; all pretrained parameters remain frozen, preserving modularity and scalability.

2. Source Embeddings and Meta-Space Projection

For sentence s=(w1,w2,,wT)s = (w_1, w_2, \ldots, w_T), nn pretrained NLMs yield contextualized token embeddings xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}, with ii indexing the source and tt the token. To facilitate integration, source embeddings are projected into a common meta-space of dimension dmd_m using trainable linear maps PiRdm×diP_i \in \mathbb{R}^{d_m \times d_i}. This yields,

zi,t=Pixi,tRdm.z_{i,t} = P_i x_{i,t} \in \mathbb{R}^{d_m}.

Each PiP_i is regularized towards orthonormality to maintain variance and avoid rank deficiency.

3. Token-Level Dynamic Attention

Token-level fusion leverages a learnable attention over sources, enabling context-sensitive weighting. Each source ii is assigned an attention vector nn0. For token nn1, attention logits are computed:

nn2

and normalized with softmax across sources,

nn3

The meta-embedding at position nn4 is a weighted sum,

nn5

4. Sentence-Level Pooling and Embedding Extraction

The final sentence-level embedding is obtained by pooling over the token meta-embeddings nn6. CDME supports two pooling schemes:

  • Mean pooling:

nn7

  • Element-wise max pooling:

nn8

Max pooling consistently showed slightly superior STS correlation in reported experiments and is the default pooling operation.

5. Unsupervised Objective and Optimization

All learnable parameters are updated by minimizing an unsupervised loss, combining four desiderata for token projections:

  1. Same token, same context: nn9 and xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}0 for identical xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}1 across sources should be close.
  2. Different tokens, same context: Distinct tokens in the same sentence should remain distinguishable.
  3. Same token, different contexts: The same word in different sentences should have context-dependent representations.
  4. Different tokens, different contexts: Different words in different sentences should be maximally separated.

The composite loss is

xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}2

with all terms precisely defined in the source and hyperparameters xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}3 tuned on held-out data. The loss does not reference attention weights xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}4 directly; practical learning is supported by a small auxiliary term for attention–see original Appendix for details.

The optimizer is stochastic gradient descent (SGD) with weight decay xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}5, learning rate xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}6, and batch size 512. Early stopping uses Pearson correlation on STS-B development data, with convergence reported in xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}78 hours for two sources on a single Quadro RTX 8000 GPU.

6. Empirical Performance and Comparative Analysis

Evaluation follows canonical STS benchmarks (STS-15, STS-16, STS-B). For each sentence pair xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}8, the cosine similarity of their meta-embeddings is measured and compared to human annotations via Pearson xi,t=fi(wt,s)Rdix_{i,t} = f_i(w_t, s) \in \mathbb{R}^{d_i}9 and Spearman ii0.

Method STS-15 STS-16 STS-B
SSE–BERT 87.15/87.34 81.98/83.05 82.40/82.74
SSE–RoBERTa 87.72/87.88 84.55/84.93 82.46/83.26
CONC 88.64/88.59 84.27/84.88 84.69/85.14
AVG 88.47/88.45 83.97/84.63 84.21/84.47
SVD 88.66/88.65 84.07/84.63 83.98/84.61
GCCA 88.58/88.58 84.00/84.53 83.36/84.14
SUP 89.34/89.30 85.11/85.72 65.21/64.79
UNSUP (CDME) 88.76/88.85 85.06/85.33 85.33/86.08

Observations: CDME’s unsupervised meta-embeddings (“UNSUP”) surpass all prior unsupervised methods (CONC, AVG, SVD, GCCA) on every split, and exceed the supervised meta-embedding “SUP” on STS-15 and STS-16. On STS-B, where SUP is directly trained on the same labels, CDME remains competitive. The improvements are statistically significant at ii1 in most settings. Ablations confirm the necessity of dynamic token-level attention and max pooling; setting ii2 (uniform attention) degrades STS-B performance from 85.33/86.08 to 81.98/83.00.

7. Strengths, Limitations, and Future Directions

Strengths:

  • Fully unsupervised, requiring only raw sentence corpora and no labeled similarity data.
  • Highly modular, supporting any number of source models of arbitrary dimensionality through compact projections.
  • Dynamic source selection, allowing attention to reflect token- and context-specific reliability.
  • Yields robust, context-sensitive sentence embeddings outperforming contemporary unsupervised and some supervised methods.

Limitations and Open Questions:

  • Empirical results are limited to ii3 sources of equal dimensionality; extension to larger, more heterogeneous ensembles remains untested at scale.
  • The attention mechanism is a source-specific dot product; extensions to richer multi-head or context-aware mechanisms are plausible avenues for improved expressivity.
  • Present formulation does not process subword-level or hierarchical input structure, limiting granularity and potential performance on longer or nested sequences.
  • Linear projections suffice in current experiments, but more expressive nonlinear mappings (e.g., small feed-forward networks) could offer additional flexibility.
  • No ablation evaluates direct integration of recent generator or discriminative LMs (e.g., GPT, XLNet, ELECTRA) into the meta-embedding workflow.

CDME establishes a lightweight, unsupervised, and universally adaptable framework for synthesizing multiple contextual LLMs into a unified sentence embedding, with demonstrated superiority on standard STS tasks (Takahashi et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contextualized Dynamic Meta-Embeddings (CDME).