Dynamic Word Embeddings

Updated 12 May 2026

Dynamic word embeddings are vector representations that adapt to changes in context, time, or domain while capturing semantic drift and polysemy.
Methodologies include dynamic skip-gram models with Gaussian diffusion, tensor factorization with alignment, and neural architectures conditioned on extralinguistic metadata.
They enable improved performance in bias analysis, sentiment classification, and event tracking, offering clearer insights compared to static embedding approaches.

Dynamic word embeddings are vector-space representations of lexical items that capture semantic, syntactic, or discourse-level properties that change as a function of extrinsic attributes—most commonly time, domain, or context. Unlike static embeddings, which assign each word type a single vector regardless of context or epoch, dynamic word embeddings seek to model semantic drift, polysemy, social variation, and emerging linguistic phenomena by learning time- or context-dependent representations. Methodologies span continuous dynamical systems, probabilistic models with temporal priors, low-rank matrix/tensor decompositions, and contextualized neural architectures adapted to extralinguistic metadata.

1. Mathematical Formulations and Model Classes

Dynamic word embeddings are built upon several mathematically distinct frameworks, unified by the central operation of mapping a tuple (word, context, extralinguistic attribute) to a vector in $\mathbb{R}^d$ or (in quantum-inspired models) a unit-norm vector in a Hilbert space.

Continuous and Discrete-Time Trajectory Models

A canonical formulation is the use of a sequence of embedding matrices $\{U_t\}$ , where $t$ indexes discrete time slices or domains, and each $U_t[w]$ is the vector embedding for word $w$ at $t$ . This is prominent in time-series generalizations of skip-gram and CBOW:

Dynamic Skip-Gram (DSG): Each $u_{i,t}$ evolves via a Gaussian (or Ornstein–Uhlenbeck) prior:

$U_t | U_{t-1} \sim \mathcal{N}(U_{t-1}, D I)$

Embeddings are optimized for likelihood under observed co-occurrence data, regularized by these temporal priors (Bamler et al., 2017, Montariol et al., 2019).

Dynamic Bernoulli Embeddings (DBE): A random-walk prior is placed on word-specific target embeddings $\rho_v^{(t)}$ :

$\rho_v^{(t)} | \rho_v^{(t-1)} \sim \mathcal{N}(\rho_v^{(t-1)}, \lambda^{-1}I)$

Context embeddings remain static (Rudolph et al., 2017, Montariol et al., 2019).

Matrix/Tensor Factorization with Alignment: Regularized joint factorizations simultaneously enforce slice-specific fidelity and temporal smoothness:

$\{U_t\}$ 0

(Yao et al., 2017, Brandl et al., 2022).

Attribute-Conditioned and Contextualized Representations

Attribute-conditioned models extend the embedding function to arbitrary attributes: $\{U_t\}$ 1 where $\{U_t\}$ 2 is a global (attribute-invariant) embedding, $\{U_t\}$ 3 indexes attribute values (e.g., time, domain, city), and $\{U_t\}$ 4 are learned offsets (Gillani et al., 2019).

Dynamic Contextualized Embeddings

Neural architectures based on pre-trained LLMs (PLMs) have been adapted to dynamic settings by augmenting or conditioning input embeddings dynamically:

Dynamic Contextualized Word Embeddings (DCWE): For token $\{U_t\}$ 5 at (social unit $\{U_t\}$ 6, time $\{U_t\}$ 7), the dynamic input is

$\{U_t\}$ 8

where $\{U_t\}$ 9 is a social-context embedding obtained via a time-specific graph attention network, regularized by Gaussian anchoring and temporal random-walk priors (Hofmann et al., 2020).

Template-Based Temporal Adaptation: Masked LLMs (MLMs) are adapted to later timestamps using temporally-sensitive prompts derived from anchor/pivot term extraction, leading to new embedding parameters $t$ 0 specific to the epoch (Tang et al., 2022).

Hilbert-Space and Quantum-Contextual Approaches

A recent alternative is based on quantum contextuality:

Quantum Contextual Embeddings: Each word $t$ 1 is a unit vector $t$ 2, and each context $t$ 3 is an orthonormal basis $t$ 4 of $t$ 5. Word sense in context $t$ 6 is determined probabilistically via the Born rule:

$t$ 7

Polysemy arises from vectors $t$ 8 appearing in multiple, possibly incompatible, bases (Svozil, 18 Apr 2025).

2. Training Objectives and Optimization Algorithms

Training objectives for dynamic word embeddings combine likelihood under observed corpora and explicit temporal, structural, or contextual regularization.

Temporal Priors and Drift Regularization

Random Walk / Diffusion Priors: Temporal smoothing is enforced via Gaussian penalties on drift:

$t$ 9

This penalizes abrupt changes and yields smooth, interpretable trajectories (Bamler et al., 2017, Rudolph et al., 2017, Montariol et al., 2019).

HardThreshold Drift Regularizer: Enhanced separation of stable and drifting words under scarcity:

$U_t[w]$ 0

(Montariol et al., 2019).

Matrix/Tensor Factorization Alignment

Joint Regularized Factorization: Loss functions combine reconstruction, pairwise-alignment, and latent structural affinity weights $U_t[w]$ 1:

$U_t[w]$ 2

with $U_t[w]$ 3 learned by inverting slice distances (Brandl et al., 2022).

Neural and Contextualized Optimization

DCWE and Temporal/Attribute Adaptation: Models are optimized end-to-end with cross-entropy on masked LM or task objectives, plus anchoring and random-walk priors on offset parameters. Graph-based and feed-forward modules modeling external structure are updated by backpropagation alongside the base PLM parameters (Hofmann et al., 2020, Tang et al., 2022).

Quantum Contextual Training (Theoretical)

KL-Divergence from Target Sense Distribution: Not implemented at scale; proposed as:

$U_t[w]$ 4

with joint optimization over $U_t[w]$ 5 and $U_t[w]$ 6 under orthonormality constraints (Svozil, 18 Apr 2025).

3. Temporal and Contextual Alignment Techniques

The "alignment problem"—the lack of consistent coordinate systems across independently trained time/domain slices—necessitated the development of alignment-aware dynamic models.

Joint-Smoothing and Alignment: Regularized models (temporal priors, structural constraints) jointly couple embeddings at all time points, eliminating the need for post-hoc orthogonal Procrustes alignment and producing consistent trajectories (Bamler et al., 2017, Yao et al., 2017, Brandl et al., 2022).
Unified Embedding Space: Attribute-conditioned additive models inherently align all attribute-specific embeddings via the global $U_t[w]$ 7 (Gillani et al., 2019).
Contextual Adaptation in PLMs: Dynamic contextualization is achieved by reparameterizing input layers and fine-tuning transformer-based architectures with temporal/social prompts, ensuring embeddings are adapted yet still comparable across epochs or social units (Hofmann et al., 2020, Tang et al., 2022).

4. Evaluation, Empirical Findings, and Applications

Dynamic embeddings have been empirically validated via intrinsic and extrinsic tasks. Key experimental paradigms include:

Intrinsic Metrics

Held-Out Likelihood/Perplexity: Dynamic models consistently outperform static or incrementally-trained baselines in predictive fit, particularly under data scarcity (Bamler et al., 2017, Montariol et al., 2019, Rudolph et al., 2017, Hofmann et al., 2020).
Semantic Drift Analysis: Distance metrics $U_t[w]$ 8 and t-SNE/trajectory visualizations reveal smooth, interpretable evolution of semantic neighborhoods (e.g., “computer” drifting from mechanical calculators to digital contexts) (Rudolph et al., 2017, Yao et al., 2017, Montariol et al., 2019).
Temporal Analogy Tasks: Retrieval of time-aligned equivalents and role-holders; dynamic models achieve higher mean reciprocal rank and precision at K over baselines dependent on alignment (Yao et al., 2017, Brandl et al., 2022).
Structure Prediction: In domain/context partitioned datasets, techniques that jointly recover sub-corpus structure (e.g., latent affinity matrices) improve recall of known taxonomic or temporal relationships (Brandl et al., 2022).

Extrinsic and Downstream Tasks

Bias Analysis: Dynamic embeddings enable measurement of gender and ethnic occupation bias trajectories and their alignment with demographic data (Gillani et al., 2019).
Sentiment and Classification: Incorporating dynamic contextualization yields modest but statistically significant improvements in classification accuracy and F $U_t[w]$ 9 (Hofmann et al., 2020).
Event and Concept Tracking: Changes in nearest-neighbor sets over time have been used to track sociological and technological shifts (“blackberry” from fruit to device and back) (Brandl et al., 2022).

Polysemy and Dimensionality

Stochastic-Dimensionality Models: The number of embedding dimensions per word, inferred nonparametrically, reflects word frequency and degree of polysemy, with broad terms allocated more active dimensions (Nalisnick et al., 2015).
Quantum Contextuality: Proposed as an alternative mechanism to statically encode context/prominence of word senses via joint participation in distinct bases, offering an explicit probabilistic sense distribution (Svozil, 18 Apr 2025).

5. Practical Considerations: Data Scarcity, Initialization, and Scalability

Dynamic embedding models must address challenges aggravated by temporal/data sparsity and high dimensionality.

Data Scarcity

Smoothing and Sharing: Temporal priors (e.g., diffusion, random walk) and global embeddings (e.g., $w$ 0) smooth over sparse slices, preserving continuity and suppressing noise (Montariol et al., 2019, Gillani et al., 2019).
Initialization: Static pre-training (on concatenated corpora) yields significant gains under data scarcity. Backward-initialization—aligning from large, late-period corpora—can be optimal for long diachronic ranges (Montariol et al., 2019).
Regularizers: Hard-thresholded drift penalties enhance interpretability of semantic drift under low-resource conditions (Montariol et al., 2019).

Scalability

Block Coordinate and Minibatch Optimization: Efficient sparse matrix operations, block coordinate updates, and scalable variational inference enable learning on corpora spanning tens to hundreds of time slices and large vocabularies (Yao et al., 2017, Bamler et al., 2017).
Contextualized PLMs: Augmentations to BERT-scale models remain tractable through modular feed-forward and graph attention layers, only marginally increasing wall-time or memory (Hofmann et al., 2020).
Quantum Contextual Models: While mathematically attractive, scaling joint learning of intertwining bases and orthonormality constraints to large lexicons remains an unsolved challenge (Svozil, 18 Apr 2025).

6. Limitations, Open Questions, and Frontier Directions

Dynamic embedding research highlights several unresolved issues:

Abrupt Change and Non-Gaussian Dynamics: Existing models predominantly assume smooth Gaussian (Brownian/O-U) drift, limiting detection of sudden concept shifts or change-points. Extensions to piecewise or nonstationary priors are an open problem (Yao et al., 2017).
Cross-Linguistic Trajectory Analysis: Dynamic embeddings facilitate cross-lingual comparison of semantic drift post static-alignment, but aligning trajectories with fine temporal granularity in multilingual settings presents both computational and theoretical difficulties (Montariol et al., 2019).
Polysemy Modeling: While stochastic-dimensional and quantum models offer interpretable proxies for word complexity and sense distribution, robust benchmarks for evaluating polysemous and context-sensitive representations over time are scarce (Nalisnick et al., 2015, Svozil, 18 Apr 2025).
Evaluation Paradigms: Most work is restricted to intrinsic evaluation (semantic similarity, analogy, or drift visualization); few extrinsic tasks exist that are specifically sensitive to temporal or contextual adaptation. Development of gold-standard benchmarks for dynamic sense disambiguation remains a priority (Yao et al., 2017, Montariol et al., 2019).
Template and Prompt-Based Adaptation: Techniques for automated template generation and selection in the temporal adaptation of PLMs are active research areas. The balance between template diversity and noise, as well as the generalization to low-resource or multilingual settings, remains underexplored (Tang et al., 2022).

7. Summary Table: Core Dynamic Embedding Methodologies

Model/Method	Core Mathematical Device	Temporal Regularization
Dynamic Skip-Gram / DSG	Sequential Gaussian diffusion, ELBO	Explicit, via prior
Dynamic Bernoulli Embeddings / DBE	Random-walk prior on embeddings	Explicit, via prior
Structure Prediction (W2VPred)	Joint factorization, latent affinity matrix	Implicit, via $w$ 1
Unified Additive Attribute Models	$w$ 2 + per-attribute $w$ 3	Implicit/global
Dynamic Contextualized Embedding (DCWE)	FFN + GAT over social/time, PLM adaptation	Anchoring, random-walk
Stochastic Dimensionality Skip-Gram (SD-SG)	Distribution over embedding dimension $w$ 4	Geometric+nonparametric
Quantum Contextual Word Embedding	Hilbert space, intertwining contexts	Theoretical

Each model is distinguished by (i) the locus of representation dynamics (type, token, context, attribute), (ii) the form and explicitness of temporal or contextual priors, and (iii) the relationship between embedding alignment and learning.

Dynamic word embeddings constitute a broad, technically rigorous field integrating time series analysis, Bayesian inference, graph and matrix factorization, and neural contextualization. The current frontier spans quantum formalizations, contextually adaptive PLMs, and scalable cross-linguistic models, with empirical focus steadily shifting from static, temporally-agnostic word spaces to architectures reflecting the true dynamism of language in use. Further progress depends on advances in scalable optimization, enriched annotation and evaluation paradigms, and theoretical innovations that reconcile continuity, abruptness, and interpretability in lexical semantics.