Dynamic Bernoulli Embeddings
- The paper introduces DBE, a probabilistic framework that models temporal evolution of embeddings using a Bernoulli likelihood combined with a Gaussian random-walk prior.
- It employs MAP estimation with negative sampling and HardShrink regularization to ensure temporal smoothness and robust drift detection.
- DBE effectively tracks semantic and structural changes in language and networks, outperforming static and alternative dynamic methods under data scarcity.
Dynamic Bernoulli Embeddings (DBE) are a probabilistic framework for modeling temporally-evolving word or node embeddings based on observed co-occurrences. Originally developed for diachronic word semantics, DBE combines a Bernoulli likelihood for observed context with a time-indexed Gaussian random-walk prior, enforcing temporal smoothness while capturing lexical or structural drift. The DBE model has been applied to language evolution (Rudolph et al., 2017), sparse temporal text (Montariol et al., 2019, Montariol et al., 2019), and dynamic networks (Chen et al., 2019).
1. Probabilistic Formulation
Let denote the number of time slices, the vocabulary size (or number of nodes), and the embedding dimension. For each word (or node) at time , DBE introduces a temporal embedding and a context embedding , with ranging over word or node indices. For each observed pair at time :
where .
The complete-likelihood for all observations is given by:
Temporal smoothness is encoded via a zero-mean Gaussian prior on and a Markovian random-walk prior for :
and
This coupling enforces that most words/nodes evolve gradually. Equivalent objectives arise in both word embedding and dynamic network contexts (Rudolph et al., 2017, Chen et al., 2019, Montariol et al., 2019).
2. MAP Learning and Negative Sampling
Maximum a posteriori (MAP) estimation is used by minimizing:
where the quadratic regularizer is
Exact likelihood sums are replaced by negative sampling: for each positive , a fixed number of negatives are sampled (from a unigram or power-law distribution). The stochastic objective is optimized using minibatch SGD or Adam. Notably, DBE requires no Kalman filtering, variational Bayes, or explicit post-hoc alignment across time (Rudolph et al., 2017, Montariol et al., 2019, Chen et al., 2019).
3. Extensions: Data Scarcity, Initialization, and Drift Regularization
When corpora are sparse, DBE model robustness depends crucially on initialization and regularization (Montariol et al., 2019, Montariol et al., 2019):
- Initialization:
- Random: Embeddings are initialized independently from a Gaussian.
- Internal: A static Bernoulli embedding is fit to the full corpus, initializing , or all .
- Backward external: Pretrained embeddings (e.g. Wikipedia) are loaded for , then the model is run backwards in time.
- Drift Threshold Regularization: To better discriminate between stable and high-drift words/nodes under high sparsity, a drift-penalty using the HardShrink operator is added:
where is the mean empirical drift, and HardShrink suppresses minor fluctuations while capping outliers.
- Drift Prior Variants: Comparative evaluation of chronological (DBE), non-chronological ( anchoring to ), semi-chronological, and incremental (no penalty) priors reveals only the standard DBE model achieves directed, smooth, and robust drift trajectories in empirical word drift detection (Montariol et al., 2019).
4. Empirical Performance and Qualitative Analysis
Predictive power:
DBE achieves superior or comparable held-out log-likelihood on positive examples under data scarcity:
| Model | Random init | Internal | Backward external |
|---|---|---|---|
| ISG (SGNS incr.) | –3.17 | –2.59 | –2.69 |
| DSG (dyn. SGNS) | –0.749 | –0.686 | –0.695 |
| DBE | –2.935 | –2.236 | –2.459 |
Internal initialization is most beneficial for short, low-variance periods, while backward external becomes advantageous with longer periods or larger slices (Montariol et al., 2019).
Directed drift and stability:
- Directed drift: On moderate sparsity (10% data), both DBE and dynamic SGNS (DSG) recover monotonic drift, whereas at extreme sparsity (1%), only DBE/DSG maintain directionality—incremental SGNS fails due to noise dominance.
- Stability vs. extreme drifts: The Gaussian-walk prior clusters the majority of words around small drift values. Under extreme scarcity, drift distribution tails contract and model discriminability for highly drifting entities drops, but introducing HardShrink-based penalization restores separation between extreme and stable elements (Montariol et al., 2019, Montariol et al., 2019).
Semantic and node trajectory case studies:
In congressional debates or scientific corpora, DBE captures smooth trajectories: e.g., "computer" transitions from occupational to technological senses; "bush" evolves from plant-names to political context (Rudolph et al., 2017). In dynamic networks, evolving nodes (e.g., students during recess, researchers switching departments) trace interpretable paths in the learned space, corroborated by metadata (Chen et al., 2019).
5. Applications and Evaluation Protocols
Textual data:
- Analyzing semantic change across decades of legislative, scientific, or news text.
- Cross-lingual comparison using alignment-initialized DBEs, revealing both convergent and divergent semantic evolution (e.g., "Barbie" anchored to a war criminal in French vs. rapid drift toward the fashion toy in English) (Montariol et al., 2019).
Network data:
- Link prediction: DBE outperforms static and other dynamic baselines on network evolution benchmarks, as measured by AUC in predicting emerging links across multiple real-world graphs (Chen et al., 2019).
- Evolving-node detection: Nodes with top drift metrics are interpretable as "active" or "evolving"; DBE achieves higher mean average precision and recall relative to alternatives, especially at low sample sizes.
- Visualization: Embedding trajectories reveal dynamic group structure, mobility, and organizational change.
6. Hyperparameter Choices and Training Details
Practical DBE instantiations use embedding dimensions of (language) or (network), context windows , negative samples per positive ranging from 1 to 20, and drift regularization coefficients (e.g., , ). Optimization is performed with SGD or Adam, using mini-batches interleaved over time slices and early stopping on validation log-likelihood (Rudolph et al., 2017, Montariol et al., 2019, Chen et al., 2019, Montariol et al., 2019).
7. Significance and Theoretical Positioning
Dynamic Bernoulli Embeddings provide a principled method for learning temporally-indexed embeddings that achieve temporal continuity, directed semantic drift, and robust separation between stable and rapidly-evolving entities. The Gaussian random-walk prior is essential for this behavior. Unlike approaches requiring explicit time-step alignment or expensive variational frameworks, DBE leverages a simple and scalable MAP objective with negative sampling. This balance of statistical structure, computational tractability, and empirical effectiveness positions DBE as a canonical method for temporally-aware representation learning on both text and network data (Rudolph et al., 2017, Montariol et al., 2019, Chen et al., 2019, Montariol et al., 2019).