Contextualized Dynamic Meta-Embeddings
- The paper introduces CDME as an unsupervised framework that combines multiple contextualized language models into a unified sentence embedding.
- It projects source token embeddings into a common meta-space and applies dynamic token-level attention to weight contributions based on context.
- Empirical results show that CDME outperforms traditional unsupervised methods and competes with supervised approaches on semantic similarity tasks.
Contextualized Dynamic Meta-Embeddings (CDME) are a framework for producing sentence-level representations by combining multiple independently trained contextualized LLMs into a unified meta-embedding. CDME is designed to maximize the complementary strengths of source models, accommodate differences in dimensionality, and remain fully unsupervised and task-agnostic. The architecture operates at the token level by projecting source embeddings into a shared meta-space, weighting their contributions via learnable attention, and producing the sentence-level vector through pooling. Empirical evaluation on semantic textual similarity (STS) benchmarks demonstrates that CDME achieves superior performance compared to both unsupervised and some supervised baselines, without reliance on labeled data (Takahashi et al., 2022).
1. Motivation and Design Rationale
Conventional static meta-embedding methods, designed for non-contextual vectors, are ill-suited to contextualized LLMs (NLMs) such as BERT, RoBERTa, and ELMo due to three factors:
- Context dependence: Token vectors from NLMs vary with sentential context.
- Heterogeneous dimensionalities: Source models produce vectors in differing dimensions, precluding naive concatenation for more than a few sources.
- Unsupervised generality: Fine-tuning for each task is costly; reusable, unsupervised sentence embeddings are preferred.
CDME addresses these challenges by (1) working at the token level with context-sensitive meta-embeddings, (2) mapping all sources into a unified meta-space, and (3) learning dynamic—context- and token-sensitive—attention weights for fusion. No source LLMs are fine-tuned; all pretrained parameters remain frozen, preserving modularity and scalability.
2. Source Embeddings and Meta-Space Projection
For sentence , pretrained NLMs yield contextualized token embeddings , with indexing the source and the token. To facilitate integration, source embeddings are projected into a common meta-space of dimension using trainable linear maps . This yields,
Each is regularized towards orthonormality to maintain variance and avoid rank deficiency.
3. Token-Level Dynamic Attention
Token-level fusion leverages a learnable attention over sources, enabling context-sensitive weighting. Each source is assigned an attention vector 0. For token 1, attention logits are computed:
2
and normalized with softmax across sources,
3
The meta-embedding at position 4 is a weighted sum,
5
4. Sentence-Level Pooling and Embedding Extraction
The final sentence-level embedding is obtained by pooling over the token meta-embeddings 6. CDME supports two pooling schemes:
- Mean pooling:
7
- Element-wise max pooling:
8
Max pooling consistently showed slightly superior STS correlation in reported experiments and is the default pooling operation.
5. Unsupervised Objective and Optimization
All learnable parameters are updated by minimizing an unsupervised loss, combining four desiderata for token projections:
- Same token, same context: 9 and 0 for identical 1 across sources should be close.
- Different tokens, same context: Distinct tokens in the same sentence should remain distinguishable.
- Same token, different contexts: The same word in different sentences should have context-dependent representations.
- Different tokens, different contexts: Different words in different sentences should be maximally separated.
The composite loss is
2
with all terms precisely defined in the source and hyperparameters 3 tuned on held-out data. The loss does not reference attention weights 4 directly; practical learning is supported by a small auxiliary term for attention–see original Appendix for details.
The optimizer is stochastic gradient descent (SGD) with weight decay 5, learning rate 6, and batch size 512. Early stopping uses Pearson correlation on STS-B development data, with convergence reported in 78 hours for two sources on a single Quadro RTX 8000 GPU.
6. Empirical Performance and Comparative Analysis
Evaluation follows canonical STS benchmarks (STS-15, STS-16, STS-B). For each sentence pair 8, the cosine similarity of their meta-embeddings is measured and compared to human annotations via Pearson 9 and Spearman 0.
| Method | STS-15 | STS-16 | STS-B |
|---|---|---|---|
| SSE–BERT | 87.15/87.34 | 81.98/83.05 | 82.40/82.74 |
| SSE–RoBERTa | 87.72/87.88 | 84.55/84.93 | 82.46/83.26 |
| CONC | 88.64/88.59 | 84.27/84.88 | 84.69/85.14 |
| AVG | 88.47/88.45 | 83.97/84.63 | 84.21/84.47 |
| SVD | 88.66/88.65 | 84.07/84.63 | 83.98/84.61 |
| GCCA | 88.58/88.58 | 84.00/84.53 | 83.36/84.14 |
| SUP | 89.34/89.30 | 85.11/85.72 | 65.21/64.79 |
| UNSUP (CDME) | 88.76/88.85 | 85.06/85.33 | 85.33/86.08 |
Observations: CDME’s unsupervised meta-embeddings (“UNSUP”) surpass all prior unsupervised methods (CONC, AVG, SVD, GCCA) on every split, and exceed the supervised meta-embedding “SUP” on STS-15 and STS-16. On STS-B, where SUP is directly trained on the same labels, CDME remains competitive. The improvements are statistically significant at 1 in most settings. Ablations confirm the necessity of dynamic token-level attention and max pooling; setting 2 (uniform attention) degrades STS-B performance from 85.33/86.08 to 81.98/83.00.
7. Strengths, Limitations, and Future Directions
Strengths:
- Fully unsupervised, requiring only raw sentence corpora and no labeled similarity data.
- Highly modular, supporting any number of source models of arbitrary dimensionality through compact projections.
- Dynamic source selection, allowing attention to reflect token- and context-specific reliability.
- Yields robust, context-sensitive sentence embeddings outperforming contemporary unsupervised and some supervised methods.
Limitations and Open Questions:
- Empirical results are limited to 3 sources of equal dimensionality; extension to larger, more heterogeneous ensembles remains untested at scale.
- The attention mechanism is a source-specific dot product; extensions to richer multi-head or context-aware mechanisms are plausible avenues for improved expressivity.
- Present formulation does not process subword-level or hierarchical input structure, limiting granularity and potential performance on longer or nested sequences.
- Linear projections suffice in current experiments, but more expressive nonlinear mappings (e.g., small feed-forward networks) could offer additional flexibility.
- No ablation evaluates direct integration of recent generator or discriminative LMs (e.g., GPT, XLNet, ELECTRA) into the meta-embedding workflow.
CDME establishes a lightweight, unsupervised, and universally adaptable framework for synthesizing multiple contextual LLMs into a unified sentence embedding, with demonstrated superiority on standard STS tasks (Takahashi et al., 2022).