Papers
Topics
Authors
Recent
Search
2000 character limit reached

Word–Context Matrices: Modeling Lexical Semantics

Updated 26 January 2026
  • Word–Context Matrices are foundational constructs in lexical semantics, encoding association strengths between words and contexts with count-based and PMI methods.
  • They enable efficient low-rank factorization techniques like SVD and negative sampling that boost performance in word similarity and analogy tasks.
  • Advanced variants extend these matrices with dependency, positional, and tensor-based contexts to capture richer syntactic and semantic nuances.

Word–context matrices are central to the quantitative modeling of lexical semantics and the construction of distributional word representations. Structurally, a word–context matrix is defined as MRV×CM \in \mathbb{R}^{|V| \times |C|}, where V|V| and C|C| are the sizes of the word and context vocabularies respectively. Each entry MwcM_{wc} quantifies the association strength between target word ww and context cc, most commonly by empirical co-occurrence counts or by statistics such as pointwise mutual information (PMI) and its positive variant PPMI. Modern embedding methodologies—matrix factorization, window sampling, negative sampling, and tensor generalizations—are built upon the foundational concept of this matrix, which encodes the observed word–context associations within a corpus.

1. Construction of the Word–Context Matrix

Let CC be a corpus, and let VV denote the vocabulary. Sliding a symmetric window of size winwin over CC, the number of times a word ww and context cc co-occur is accumulated as MwcM_{wc}, the raw co-occurrence count. The global statistics are

  • Mw=cMwcM_{w*} = \sum_c M_{wc} (row sum for word ww),
  • Mc=wMwcM_{*c} = \sum_w M_{wc} (column sum for context cc),
  • M=w,cMwcM_{**} = \sum_{w,c} M_{wc} (total windowed pairs).

The transformation to probabilities is given by

P(w,c)=MwcM,P(w)=MwM,P(c)=McMP(w,c) = \frac{M_{wc}}{M_{**}}, \quad P(w) = \frac{M_{w*}}{M_{**}}, \quad P(c) = \frac{M_{*c}}{M_{**}}

and the positive pointwise mutual information matrix (PPMI) is defined entrywise as

PPMI(w,c)=max(0,logP(w,c)P(w)P(c))\mathrm{PPMI}(w,c) = \max\left(0, \log \frac{P(w,c)}{P(w)P(c)}\right)

or, more robustly with counts,

PPMIwc=max(0,logMwcMMwMc)\mathrm{PPMI}_{wc} = \max\left(0, \log \frac{M_{wc}\,M_{**}}{M_{w*}\,M_{*c}}\right)

This matrix forms the basis for most count-based embeddings, including LexVec, GloVe, and their modern variants (Salle et al., 2016, Salle et al., 2016, Ibrahim et al., 2021).

2. Low-Rank Factorization and Embedding Learning

The high-dimensional, sparse word–context matrix must be compressed for practical representation. The classic approach is low-rank approximation:

  • SVD (Singular Value Decomposition): Given ARn×nA \in \mathbb{R}^{n \times n} (PMI/PPMI), A=UΣVA = U \Sigma V^\top is truncated to the first dd singular values, yielding word embeddings as the rows of UdΣd1/2Rn×dU_d \Sigma_d^{1/2} \in \mathbb{R}^{n \times d} (Sorokina et al., 2019).
  • NMF (Non-negative Matrix Factorization): For A0A \ge 0, AWHA \approx WH with W,H0W, H \ge 0. Row ii of WW gives the embedding for word ii; results are typically sparse and nonnegative but lags SVD in empirical semantic quality.
  • QR with Pivoting: AP=QRA P = QR; embeddings are either the rows of QdQ_d or the relevant columns of RdPR_d P^\top (Sorokina et al., 2019).
  • SGNS and Weighted Losses: Models such as LexVec minimize a weighted squared-error objective, penalizing frequent (positive) co-occurrences heavily and actively pushing apart negative (zero-PPMI) co-occurrences via negative sampling. In effect, stochastic sampling and local optimization sidestep explicit materialization of the full PPMI (Salle et al., 2016).

Truncated SVD has been empirically shown to outperform NMF and QR approaches in both word similarity (e.g., WordSim-353, Spearman ρ ≈ 0.70) and analogy completion tasks, with QR(Q) performing notably poorly (similarity ≈ 0.27, analogy < 0.02) (Sorokina et al., 2019).

3. Context Type, Structural Features, and Positional Extensions

The definition of "context" in MwcM_{wc} is extensible. Most frequently, cc is a word appearing within a symmetric window around ww, but context can be generalized:

  • Dependency and Syntactic Features: Structural features from dependency parses augment the context set. The context cc may encode dependency arcs, relation patterns, or subject-object links, treated as discrete features (Cross et al., 2015).
  • Positional Contexts: Rather than collapsing contexts by type, each occurrence is indexed by positional offset ii relative to the focus word. The new context is cic^i, and the matrix becomes V×(2winV)|V| \times (2 win |V|) in size. Factorization and sampling definitions are updated accordingly, leading to large boosts in syntactic performance (e.g., GSyn analogy accuracy improves from .54/.56 to .63/.66) (Salle et al., 2016).
  • Word Order Representations: GloVe variants (e.g., WOVe) build separate co-occurrence matrices for each relative position dd in the window, then merge the resulting position-specific embeddings (usually via concatenation). This modification significantly improves analogy performance—direct concatenation achieves a 36.3% improvement over GloVe on analogy accuracy (Ibrahim et al., 2021).

A summary of context expansion strategies:

Context Formulation Matrix/Tensor Shape Notable Impact
Linear window V×V|V| \times |V| Semantic similarity, standard in SVD/SGNS
Dependency features V×C|V| \times |C| (CV|C| \gg |V|) Improved matching, compositional nuances
Positional (offset) V×2winV|V| \times 2win|V| Boosts syntactic analogy accuracy
Per-position order V×k|V| \times k (fusion) Marked gains in analogy and rare-word synonymy

4. Memory, Scalability, and Q-Contexts

Materializing MwcM_{wc} or its factorizations for large vocabularies is prohibitive. Several schemes address this:

  • External Memory Schemes: Aggregate all training pairs (w,ci)(w, c^i)—both positive and negative—to disk, then sort and aggregate. Training is performed by iterating through this external list, enabling embedding learning for unbounded corpus size with O(V)O(|V|) RAM (for marginal arrays) (Salle et al., 2016).
  • Q-Contexts (WEQ): Instead of full n×nn \times n PMI/PPMI, randomly sample knk \ll n rows (contexts) according to their squared 2\ell_2 norm, forming a k×nk \times n sketch RR. Then, perform SVD on RR to approximate the optimal embeddings, with theoretical guarantees that RTRMTMR^T R \approx M^T M in operator norm (Kong et al., 2021). Empirically, with k0.1nk \approx 0.1 n, performance is within 1-2pt of full-matrix methods but 10–30× faster in practice.

5. Higher-Order Generalizations and Matrix Theory

The word–context matrix generalizes to tensors to model NN-way relationships:

  • Tensor Factorization: Extensions such as CP-S and JCP-S fit symmetric CP decompositions to order-2 (matrix) and order-3 (tensor) PPMI arrays, enabling modeling of rich nn-way co-occurrence statistics. Third-order decompositions allow the extraction of sense-specific word compositionalities. For instance, outlier detection metrics (OD3_3) reveal that tensor-based embeddings outperform matrix-based methods for detecting trivariate semantic coherence (Bailey et al., 2017).
  • Linguistic Matrix Theory: For relational words (verbs, adjectives), a D×DD \times D empirical matrix MwM_w is constructed via regularized least squares regressions over noun arguments. Statistical analysis of the collection of such MwM_w leads to permutation-symmetric Gaussian ensembles, with higher-order invariant moments serving as signatures for cross-corpora comparison. Deviations from Gaussianity are captured via perturbative invariants, suggesting a notion of universality classes in corpus linguistics (Kartsaklis et al., 2017).

6. Empirical Evaluation and Performance Benchmarks

Benchmarking across word similarity (WordSim-353, MEN, SimLex-999) and analogy completion tasks (Google Semantic/Syntactic, MSR) reveals key findings:

  • LexVec (PPMI plus weighted matrix factorization) achieves similarity scores ρ0.76\rho \sim 0.76–$0.77$ and analogy accuracies $0.79$–$0.81$ (semantic) and $0.54$–$0.66$ (syntactic), matching or exceeding previous state-of-the-art on most tasks (Salle et al., 2016, Salle et al., 2016).
  • SVD-based PMI factorization (dimension d=250d=250 or $500$) consistently outperforms nonnegative and QR-based factorization for both similarity (0.70\sim 0.70) and analogy (0.35\sim 0.35–$0.38$) (Sorokina et al., 2019).
  • WOVe (GloVe with explicit order modeling) offers a 36.3%36.3\% gain in analogy accuracy and a 2%2\% boost in synonym discovery rates over baseline GloVe, at the expense of larger embeddings and increased training cost (Ibrahim et al., 2021).

7. Complexity, Trade-offs, and Theoretical Insights

  • Context expansion (dependency/structural, positional, per-position fusion) increases model parameter count and, for WOVe, storage by up to 10×10\times over baseline embeddings. However, gains in syntactic or compositional capacity may be substantial (Salle et al., 2016, Ibrahim et al., 2021).
  • External memory and Q-contexts enable scaling to unrestricted corpora, trading disk I/O and shuffling complexity for main-memory savings (Salle et al., 2016, Kong et al., 2021).
  • The selection of factorization method has highly significant impact on downstream performance, more so than dimensionality increase beyond setup-specific thresholds (Sorokina et al., 2019).
  • Matrix-theoretic approaches, analyzing empirical word–context matrices via random matrix statistics, suggest a small space of geometric signatures that could systematically distinguish linguistic genres or languages (Kartsaklis et al., 2017).

Word–context matrices thus remain a foundational and continually evolving object, underlying both traditional and advanced distributional semantics models in computational linguistics and NLP.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Word--Context Matrices.