Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Bernoulli Embeddings

Updated 10 February 2026
  • The paper introduces DBE, a probabilistic framework that models temporal evolution of embeddings using a Bernoulli likelihood combined with a Gaussian random-walk prior.
  • It employs MAP estimation with negative sampling and HardShrink regularization to ensure temporal smoothness and robust drift detection.
  • DBE effectively tracks semantic and structural changes in language and networks, outperforming static and alternative dynamic methods under data scarcity.

Dynamic Bernoulli Embeddings (DBE) are a probabilistic framework for modeling temporally-evolving word or node embeddings based on observed co-occurrences. Originally developed for diachronic word semantics, DBE combines a Bernoulli likelihood for observed context with a time-indexed Gaussian random-walk prior, enforcing temporal smoothness while capturing lexical or structural drift. The DBE model has been applied to language evolution (Rudolph et al., 2017), sparse temporal text (Montariol et al., 2019, Montariol et al., 2019), and dynamic networks (Chen et al., 2019).

1. Probabilistic Formulation

Let TT denote the number of time slices, LL the vocabulary size (or number of nodes), and dd the embedding dimension. For each word (or node) ww at time tt, DBE introduces a temporal embedding ρt,wRd\rho_{t, w} \in \mathbb{R}^d and a context embedding αiRd\alpha_i \in \mathbb{R}^d, with ii ranging over word or node indices. For each observed pair (w,i)(w, i) at time tt:

p(xt,w,i=1ρt,w,αi)=σ(ρt,wαi)p(x_{t,w,i}=1 \mid \rho_{t,w}, \alpha_i) = \sigma(\rho_{t,w}^\top \alpha_i)

where σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}.

The complete-likelihood for all observations is given by:

t=1Tw=1Li=1Lσ(ρt,wαi)xt,w,i[1σ(ρt,wαi)]1xt,w,i\prod_{t=1}^T \prod_{w=1}^L \prod_{i=1}^L \sigma(\rho_{t,w}^\top \alpha_i)^{x_{t,w,i}} [1-\sigma(\rho_{t,w}^\top \alpha_i)]^{1-x_{t,w,i}}

Temporal smoothness is encoded via a zero-mean Gaussian prior on ρ1,w\rho_{1,w} and a Markovian random-walk prior for t>1t>1:

p(ρ1,w)=N(0,λ01I),p(ρt,wρt1,w)=N(ρt1,w,λ1I)p(\rho_{1,w}) = \mathcal{N}(0, \lambda_0^{-1} I), \quad p(\rho_{t,w} \mid \rho_{t-1,w}) = \mathcal{N}(\rho_{t-1,w}, \lambda^{-1} I)

and

p(αi)=N(0,λ01I)p(\alpha_i) = \mathcal{N}(0, \lambda_0^{-1} I)

This coupling enforces that most words/nodes evolve gradually. Equivalent objectives arise in both word embedding and dynamic network contexts (Rudolph et al., 2017, Chen et al., 2019, Montariol et al., 2019).

2. MAP Learning and Negative Sampling

Maximum a posteriori (MAP) estimation is used by minimizing:

LDBE(ρ,α)=t=1Tw=1Li=1L[xt,w,ilogσ(ρt,wαi)+(1xt,w,i)log(1σ(ρt,wαi))]+R(ρ,α)\mathcal{L}_{\text{DBE}}(\rho, \alpha) = -\sum_{t=1}^T \sum_{w=1}^L \sum_{i=1}^L \Big[ x_{t,w,i} \log \sigma(\rho_{t,w}^\top \alpha_i) + (1-x_{t,w,i}) \log(1-\sigma(\rho_{t,w}^\top \alpha_i)) \Big] + \mathcal{R}(\rho, \alpha)

where the quadratic regularizer is

R(ρ,α)=λ02wρ1,w2+λ2t=2Twρt,wρt1,w2+λ02iαi2\mathcal{R}(\rho, \alpha) = \frac{\lambda_0}{2} \sum_w \|\rho_{1,w}\|^2 + \frac{\lambda}{2} \sum_{t=2}^T \sum_w \|\rho_{t,w} - \rho_{t-1,w}\|^2 + \frac{\lambda_0}{2} \sum_i \|\alpha_i\|^2

Exact likelihood sums are replaced by negative sampling: for each positive (w,i)(w, i), a fixed number of negatives are sampled (from a unigram or power-law distribution). The stochastic objective is optimized using minibatch SGD or Adam. Notably, DBE requires no Kalman filtering, variational Bayes, or explicit post-hoc alignment across time (Rudolph et al., 2017, Montariol et al., 2019, Chen et al., 2019).

3. Extensions: Data Scarcity, Initialization, and Drift Regularization

When corpora are sparse, DBE model robustness depends crucially on initialization and regularization (Montariol et al., 2019, Montariol et al., 2019):

  • Initialization:
    • Random: Embeddings are initialized independently from a Gaussian.
    • Internal: A static Bernoulli embedding is fit to the full corpus, initializing ρ1,w\rho_{1,w}, or all ρt,w\rho_{t,w}.
    • Backward external: Pretrained embeddings (e.g. Wikipedia) are loaded for ρT,w\rho_{T,w}, then the model is run backwards in time.
  • Drift Threshold Regularization: To better discriminate between stable and high-drift words/nodes under high sparsity, a drift-penalty using the HardShrink operator is added:

Regβ=αregt>1wHardShrink(ρt,wρt1,w,β)\mathrm{Reg}_\beta = \alpha_{\rm reg} \sum_{t>1} \sum_w \mathrm{HardShrink}(\| \rho_{t,w} - \rho_{t-1,w} \|, \beta)

where β\beta is the mean empirical drift, and HardShrink suppresses minor fluctuations while capping outliers.

  • Drift Prior Variants: Comparative evaluation of chronological (DBE), non-chronological (2\ell_2 anchoring to t=0t=0), semi-chronological, and incremental (no penalty) priors reveals only the standard DBE model achieves directed, smooth, and robust drift trajectories in empirical word drift detection (Montariol et al., 2019).

4. Empirical Performance and Qualitative Analysis

Predictive power:

DBE achieves superior or comparable held-out log-likelihood on positive examples under data scarcity:

Model Random init Internal Backward external
ISG (SGNS incr.) –3.17 –2.59 –2.69
DSG (dyn. SGNS) –0.749 –0.686 –0.695
DBE –2.935 –2.236 –2.459

Internal initialization is most beneficial for short, low-variance periods, while backward external becomes advantageous with longer periods or larger slices (Montariol et al., 2019).

Directed drift and stability:

  • Directed drift: On moderate sparsity (10% data), both DBE and dynamic SGNS (DSG) recover monotonic drift, whereas at extreme sparsity (1%), only DBE/DSG maintain directionality—incremental SGNS fails due to noise dominance.
  • Stability vs. extreme drifts: The Gaussian-walk prior clusters the majority of words around small drift values. Under extreme scarcity, drift distribution tails contract and model discriminability for highly drifting entities drops, but introducing HardShrink-based penalization restores separation between extreme and stable elements (Montariol et al., 2019, Montariol et al., 2019).

Semantic and node trajectory case studies:

In congressional debates or scientific corpora, DBE captures smooth trajectories: e.g., "computer" transitions from occupational to technological senses; "bush" evolves from plant-names to political context (Rudolph et al., 2017). In dynamic networks, evolving nodes (e.g., students during recess, researchers switching departments) trace interpretable paths in the learned space, corroborated by metadata (Chen et al., 2019).

5. Applications and Evaluation Protocols

Textual data:

  • Analyzing semantic change across decades of legislative, scientific, or news text.
  • Cross-lingual comparison using alignment-initialized DBEs, revealing both convergent and divergent semantic evolution (e.g., "Barbie" anchored to a war criminal in French vs. rapid drift toward the fashion toy in English) (Montariol et al., 2019).

Network data:

  • Link prediction: DBE outperforms static and other dynamic baselines on network evolution benchmarks, as measured by AUC in predicting emerging links across multiple real-world graphs (Chen et al., 2019).
  • Evolving-node detection: Nodes with top drift metrics are interpretable as "active" or "evolving"; DBE achieves higher mean average precision and recall relative to alternatives, especially at low sample sizes.
  • Visualization: Embedding trajectories reveal dynamic group structure, mobility, and organizational change.

6. Hyperparameter Choices and Training Details

Practical DBE instantiations use embedding dimensions of d=100d=100 (language) or d=128d=128 (network), context windows c=4c=4, negative samples per positive ranging from 1 to 20, and drift regularization coefficients (e.g., λ=1\lambda = 1, λ0=λ/1000\lambda_0 = \lambda / 1000). Optimization is performed with SGD or Adam, using mini-batches interleaved over time slices and early stopping on validation log-likelihood (Rudolph et al., 2017, Montariol et al., 2019, Chen et al., 2019, Montariol et al., 2019).

7. Significance and Theoretical Positioning

Dynamic Bernoulli Embeddings provide a principled method for learning temporally-indexed embeddings that achieve temporal continuity, directed semantic drift, and robust separation between stable and rapidly-evolving entities. The Gaussian random-walk prior is essential for this behavior. Unlike approaches requiring explicit time-step alignment or expensive variational frameworks, DBE leverages a simple and scalable MAP objective with negative sampling. This balance of statistical structure, computational tractability, and empirical effectiveness positions DBE as a canonical method for temporally-aware representation learning on both text and network data (Rudolph et al., 2017, Montariol et al., 2019, Chen et al., 2019, Montariol et al., 2019).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Bernoulli Embeddings.