Contextualized Word Embeddings
- Contextualized word embeddings are dynamic representations that assign unique vectors to each token based on its surrounding linguistic and extralinguistic context.
- They are generated using architectures like deep bidirectional transformers and BiLSTMs, which capture semantic nuances such as polysemy, semantic drift, and fine-grained lexical distinctions.
- These embeddings improve performance in applications like word sense disambiguation, named entity recognition, and topic modeling, while highlighting challenges in bias, dimensional inefficiency, and stability.
Contextualized word embeddings are vector representations of word tokens that dynamically integrate information from their surrounding linguistic— and, in advanced systems, extralinguistic—context. Unlike traditional static embeddings, which assign a single vector to every word type, contextualized embeddings assign a distinct vector to each token occurrence, enabling the representation of context-dependent meaning, polysemy, semantic drift, and fine-grained lexical distinctions. This paradigm underpins the major advances in neural language modeling and downstream NLP tasks throughout the last decade.
1. Theoretical Foundations and Distinction from Static Embeddings
Conventional word-to-vector embeddings (e.g., word2vec, GloVe) represent each vocabulary item by a single vector learned from global co-occurrence statistics. These representations are insensitive to context, forcing all possible senses and usages of a word into the same location in the vector space. Contextualized word embeddings, in contrast, map each token in a sequence to a context-sensitive vector via a function parameterized by a pretrained LLM (PLM):
where is an initial embedding (possibly static), and the LM is typically a deep transformer (e.g., BERT, RoBERTa). Thus, two identical types "bank," appearing in "river bank" versus "central bank," receive representations that are distinguishable in space, aligning with their contextual semantics (Liu et al., 2020, Hofmann et al., 2020, Wiedemann et al., 2019).
Recent theoretical proposals extend this dynamic framework. For example, the quantum-contextual model (Svozil, 18 Apr 2025) represents each word as a static vector in a Hilbert space and defines context by the measurement basis: word meaning is constructed by projecting onto a context-determined orthonormal basis. This captures polysemy through quantum contextuality—one vector encoding multiple, mutually complementary meanings depending on context.
2. Contextualization Mechanisms and Model Architectures
The principal mechanisms for producing contextualized word embeddings are deep neural sequence models, most prominently:
- Bidirectional LLMs (BiLSTM-based):
ELMo stacks deep BiLSTM layers, aggregating representations at each depth through learned linear combinations:
where is the 0-th layer's state at position 1, 2 are softmax layer weights, and 3 is a task-specific scalar (Liu et al., 2020).
- Transformer Encoders:
BERT, RoBERTa, and similar models compute representations via multi-layer bidirectional self-attention blocks. Each token's embedding at the final (or concatenated last four) layers forms its contextualized representation. BERT is pretrained via masked language modeling and next-sentence prediction objectives (Liu et al., 2020).
- Other Architectures:
Unidirectional Transformers (GPT-2, XLNet), context encoders that interpolate static and local context statistics (Horn, 2017), and approaches utilizing extralinguistic context—such as dynamic embeddings incorporating social/temporal graphs (Hofmann et al., 2020)—extend contextualization to richer settings.
Across architectures, the defining property is the production of distinct token vectors for the same word type, as a function of global, local, or even extralinguistic context (Hofmann et al., 2020, Horn, 2017).
3. Geometric and Statistical Properties
Contextualized embeddings exhibit pronounced geometric characteristics:
- Anisotropy:
Contextualized embedding spaces are highly anisotropic—clusters of contextualized word vectors occupy large, non-uniform regions. Upper layers of transformers increase context sensitivity, decreasing the average cosine self-similarity among identical word types to as low as 0.05–0.20 after subtracting random baselines (Ethayarajh, 2019, Wang et al., 2022). Less than 5% of the variance in contextualized word occurrences is captured by any static vector (Ethayarajh, 2019).
- Sense Clustering:
BERT and similar models separate different senses of polysemous/homonymous words into distinct spatial clusters (Nair et al., 2020, Wiedemann et al., 2019). Homonymy and polysemy are quantitatively distinguished via distance: homonymous senses are further apart, mirroring human semantic judgments (Nair et al., 2020).
- Variance and Consistency:
The variance of contextualized embeddings for a given word sense depends on part-of-speech, degree of polysemy, sentence length, and positional bias. For instance, nouns tend to show higher cross-context similarity (Sim_ss) than verbs or adjectives; increasing polysemy and longer sentences decrease sense invariance (Wang et al., 2022).
- Position Bias:
Embeddings of sentence-initial tokens are systematically more similar across contexts—a structural artifact of model architecture and positional encoding. Prompt-based debiasing can partially mitigate this issue, relevant for similarity-based tasks like WSD (Wang et al., 2022).
4. Practical Implementations and Applications
Contextualized word embeddings have been operationalized in a wide range of tasks:
- Word Sense Disambiguation (WSD):
Nonparametric nearest-neighbor classifiers over contextualized vector spaces approach or surpass state-of-the-art on several lexical-sample and all-words benchmarks (Wiedemann et al., 2019). Embeddings produced by BERT, when grouped by sense, reveal clear clusters that correspond to discrete WordNet synsets (Nair et al., 2020, Wiedemann et al., 2019).
- Metaphor Detection:
Deep contextualized models, when used as input to BiLSTM–attention architectures, advance the state-of-the-art for metaphor classification in benchmark datasets; direct feature extraction suffices due to the rich sense modeling by the embeddings (Aggarwal et al., 2020).
- Named Entity Recognition (NER):
Domain-adapted contextualized embeddings (e.g., ELMo trained on chemical patents) improve F1 in chemical NER, with the impact especially pronounced on rare or ambiguous entity types (Zhai et al., 2019).
- Topic Modeling:
Contextualized Word Topic Models (CWTMs) leverage BERT representations to induce topics at both word and document levels, outperforming traditional BoW-based models in topic coherence, diversity, and OOV robustness (Fang et al., 2023).
- Dynamic and Extralinguistic Embedding:
Dynamic contextualized word embeddings (DCWEs) combine type-level word identity, social graph embeddings, and time-varying contextualization, achieving improved perplexity, improved modeling of semantic drift, and downstream task performance (Hofmann et al., 2020).
- Argument Clustering and Classification:
BERT/ELMo embedding pipelines—especially when fine-tuned—boost cross-topic classification F1 by up to 20.8 points over static word embeddings, and substantially enhance aspect-based clustering (Reimers et al., 2019).
5. Limitations, Distortions, and Debiasing
Although contextualized embeddings facilitate nuanced semantic modeling, several limitations and systematic distortions are evident:
- Frequency-Driven Distortions:
The geometric spread ("bounding-sphere radius") of contextualized embeddings for frequent words is significantly larger than for rare words, causing under- or over-estimation of semantic similarity. Such artefacts perpetuate societal biases (e.g., geographic biases in country names), persist even in multilingual models, and cannot be trivially eliminated by data augmentation (Zhou et al., 2021).
- Dimensional Inefficiency:
Not all dimensions in contextualized embeddings are equally informative. Explicit masking of unessential dimensions, identified via within-sense clustering objectives, improves WSD performance and interpretability while substantially reducing parameter footprint (Jiang et al., 2019).
- Instability under Paraphrase:
Vectors for the same word in paraphrased contexts can vary substantially, negatively impacting downstream robustness. Retrofitting with paraphrase-based orthogonal transformations (PAR) reduces intra-paraphrase distances and yields significant improvements on classification, entailment, similarity, and QA tasks, without sacrificing context sensitivity (Shi et al., 2019).
- Context Selection and Scalability (Quantum Models):
Quantum-contextual embeddings, while offering interpretable, static alternatives, raise unresolved questions about basis/basis selection, scalability, and empirical integration with end-to-end systems (Svozil, 18 Apr 2025).
- Context Encoder Limitations:
Bag-of-words context encoders (e.g., ConEc) can only perform document-level contextualization, assume single-sense per document, and ignore fine-grained positional and word-order information (Horn, 2017).
6. Extending Contextualization: Cross-Lingual, Social, and Temporal Contexts
Contextualized models extend beyond strictly sentence-level linguistic inputs:
- Cross-Lingual Pretraining:
Multilingual BERT and XLM(-R) perform joint pretraining over massive multilingual corpora, enabling zero-shot transfer and robust performance in settings with limited language-specific annotation (Liu et al., 2020).
- Social and Temporal Dynamics:
DCWE frameworks encode words as functions of both social and temporal context, parameterizing embeddings by social graph neighborhoods and temporal windows. These models capture semantic diffusion, temporal drift, and social-group-specific meaning evolution in large-scale, real-world corpora such as Reddit and ArXiv abstracts (Hofmann et al., 2020).
- Dynamic Downstream Adaptation:
Models that jointly optimize on both static and dynamic signals (including social, temporal, and context-specific features) are able to track and interpret emergent meaning shifts and adapt representations to author, community, and time (Hofmann et al., 2020).
7. Interpretability, Probing, and Future Directions
Analytical and interpretability advances complement the engineering of contextualized embeddings:
- Probe-Based Analysis:
Structural and linguistic property probes reveal encoding of part-of-speech, syntactic structure, and semantic features at specific layers (Liu et al., 2020). Probing also highlights that contextualization is more pronounced at higher layers, with mid-level layers balancing local syntax and global semantics (Ethayarajh, 2019, Wang et al., 2022).
- Sense Variance Quantification:
Layer-wise and corpus-wide studies demonstrate that sense-consistency remains high across most contexts, but more polysemous terms, or those in long sentences, exhibit increased variance (Wang et al., 2022). Prompt-based debiasing offers practical remedies for over-aggregation due to positional artefacts.
- Contextualized Lexicons and Semantic Resources:
Embedding-based sense relatedness provides continuous, graded measures not captured in discrete lexicons, offering the potential to refine resources like WordNet and enhance WSD and lexical-semantic applications (Nair et al., 2020).
- Research Directions:
Open questions persist regarding the systematic debiasing of geometric distortions, integration of additional extralinguistic signals (e.g., geography), scaling to lower-resource languages and domains, learning interpretable or modular bases (as in the quantum-contextual model), and developing unsupervised or weakly supervised sense-induction methodologies (Svozil, 18 Apr 2025, Hofmann et al., 2020, Jiang et al., 2019, Zhou et al., 2021).
In total, contextualized word embeddings constitute a foundational advancement in NLP, enabling nuanced token-level semantic modeling, robust cross-domain transfer, richer downstream applications, and new diagnostics for meaning, sense, and variation. Their ongoing evolution integrates ever wider signals—social, temporal, cross-lingual, and quantum-theoretic—while raising new challenges in efficiency, interpretability, and fairness.