Word–Context Matrices: Modeling Lexical Semantics

Updated 26 January 2026

Word–Context Matrices are foundational constructs in lexical semantics, encoding association strengths between words and contexts with count-based and PMI methods.
They enable efficient low-rank factorization techniques like SVD and negative sampling that boost performance in word similarity and analogy tasks.
Advanced variants extend these matrices with dependency, positional, and tensor-based contexts to capture richer syntactic and semantic nuances.

Word–context matrices are central to the quantitative modeling of lexical semantics and the construction of distributional word representations. Structurally, a word–context matrix is defined as $M \in \mathbb{R}^{|V| \times |C|}$ , where $|V|$ and $|C|$ are the sizes of the word and context vocabularies respectively. Each entry $M_{wc}$ quantifies the association strength between target word $w$ and context $c$ , most commonly by empirical co-occurrence counts or by statistics such as pointwise mutual information (PMI) and its positive variant PPMI. Modern embedding methodologies—matrix factorization, window sampling, negative sampling, and tensor generalizations—are built upon the foundational concept of this matrix, which encodes the observed word–context associations within a corpus.

1. Construction of the Word–Context Matrix

Let $C$ be a corpus, and let $V$ denote the vocabulary. Sliding a symmetric window of size $win$ over $C$ , the number of times a word $w$ and context $c$ co-occur is accumulated as $M_{wc}$ , the raw co-occurrence count. The global statistics are

$M_{w*} = \sum_c M_{wc}$ (row sum for word $w$ ),
$M_{*c} = \sum_w M_{wc}$ (column sum for context $c$ ),
$M_{**} = \sum_{w,c} M_{wc}$ (total windowed pairs).

The transformation to probabilities is given by

$P(w,c) = \frac{M_{wc}}{M_{**}}, \quad P(w) = \frac{M_{w*}}{M_{**}}, \quad P(c) = \frac{M_{*c}}{M_{**}}$

and the positive pointwise mutual information matrix (PPMI) is defined entrywise as

$\mathrm{PPMI}(w,c) = \max\left(0, \log \frac{P(w,c)}{P(w)P(c)}\right)$

or, more robustly with counts,

$\mathrm{PPMI}_{wc} = \max\left(0, \log \frac{M_{wc}\,M_{**}}{M_{w*}\,M_{*c}}\right)$

This matrix forms the basis for most count-based embeddings, including LexVec, GloVe, and their modern variants (Salle et al., 2016, Salle et al., 2016, Ibrahim et al., 2021).

2. Low-Rank Factorization and Embedding Learning

The high-dimensional, sparse word–context matrix must be compressed for practical representation. The classic approach is low-rank approximation:

SVD (Singular Value Decomposition): Given $A \in \mathbb{R}^{n \times n}$ (PMI/PPMI), $A = U \Sigma V^\top$ is truncated to the first $d$ singular values, yielding word embeddings as the rows of $U_d \Sigma_d^{1/2} \in \mathbb{R}^{n \times d}$ (Sorokina et al., 2019).
NMF (Non-negative Matrix Factorization): For $A \ge 0$ , $A \approx WH$ with $W, H \ge 0$ . Row $i$ of $W$ gives the embedding for word $i$ ; results are typically sparse and nonnegative but lags SVD in empirical semantic quality.
QR with Pivoting: $A P = QR$ ; embeddings are either the rows of $Q_d$ or the relevant columns of $R_d P^\top$ (Sorokina et al., 2019).
SGNS and Weighted Losses: Models such as LexVec minimize a weighted squared-error objective, penalizing frequent (positive) co-occurrences heavily and actively pushing apart negative (zero-PPMI) co-occurrences via negative sampling. In effect, stochastic sampling and local optimization sidestep explicit materialization of the full PPMI (Salle et al., 2016).

Truncated SVD has been empirically shown to outperform NMF and QR approaches in both word similarity (e.g., WordSim-353, Spearman ρ ≈ 0.70) and analogy completion tasks, with QR(Q) performing notably poorly (similarity ≈ 0.27, analogy < 0.02) (Sorokina et al., 2019).

3. Context Type, Structural Features, and Positional Extensions

The definition of "context" in $M_{wc}$ is extensible. Most frequently, $c$ is a word appearing within a symmetric window around $w$ , but context can be generalized:

Dependency and Syntactic Features: Structural features from dependency parses augment the context set. The context $c$ may encode dependency arcs, relation patterns, or subject-object links, treated as discrete features (Cross et al., 2015).
Positional Contexts: Rather than collapsing contexts by type, each occurrence is indexed by positional offset $i$ relative to the focus word. The new context is $c^i$ , and the matrix becomes $|V| \times (2 win |V|)$ in size. Factorization and sampling definitions are updated accordingly, leading to large boosts in syntactic performance (e.g., GSyn analogy accuracy improves from .54/.56 to .63/.66) (Salle et al., 2016).
Word Order Representations: GloVe variants (e.g., WOVe) build separate co-occurrence matrices for each relative position $d$ in the window, then merge the resulting position-specific embeddings (usually via concatenation). This modification significantly improves analogy performance—direct concatenation achieves a 36.3% improvement over GloVe on analogy accuracy (Ibrahim et al., 2021).

A summary of context expansion strategies:

Context Formulation	Matrix/Tensor Shape	Notable Impact
Linear window	$\|V\| \times \|V\|$	Semantic similarity, standard in SVD/SGNS
Dependency features	$\|V\| \times \|C\|$ ( $\|C\| \gg \|V\|$ )	Improved matching, compositional nuances
Positional (offset)	$\|V\| \times 2win\|V\|$	Boosts syntactic analogy accuracy
Per-position order	$\|V\| \times k$ (fusion)	Marked gains in analogy and rare-word synonymy

4. Memory, Scalability, and Q-Contexts

Materializing $M_{wc}$ or its factorizations for large vocabularies is prohibitive. Several schemes address this:

External Memory Schemes: Aggregate all training pairs $(w, c^i)$ —both positive and negative—to disk, then sort and aggregate. Training is performed by iterating through this external list, enabling embedding learning for unbounded corpus size with $O(|V|)$ RAM (for marginal arrays) (Salle et al., 2016).
Q-Contexts (WEQ): Instead of full $n \times n$ PMI/PPMI, randomly sample $k \ll n$ rows (contexts) according to their squared $\ell_2$ norm, forming a $k \times n$ sketch $R$ . Then, perform SVD on $R$ to approximate the optimal embeddings, with theoretical guarantees that $R^T R \approx M^T M$ in operator norm (Kong et al., 2021). Empirically, with $k \approx 0.1 n$ , performance is within 1-2pt of full-matrix methods but 10–30× faster in practice.

5. Higher-Order Generalizations and Matrix Theory

The word–context matrix generalizes to tensors to model $N$ -way relationships:

Tensor Factorization: Extensions such as CP-S and JCP-S fit symmetric CP decompositions to order-2 (matrix) and order-3 (tensor) PPMI arrays, enabling modeling of rich $n$ -way co-occurrence statistics. Third-order decompositions allow the extraction of sense-specific word compositionalities. For instance, outlier detection metrics (OD $_3$ ) reveal that tensor-based embeddings outperform matrix-based methods for detecting trivariate semantic coherence (Bailey et al., 2017).
Linguistic Matrix Theory: For relational words (verbs, adjectives), a $D \times D$ empirical matrix $M_w$ is constructed via regularized least squares regressions over noun arguments. Statistical analysis of the collection of such $M_w$ leads to permutation-symmetric Gaussian ensembles, with higher-order invariant moments serving as signatures for cross-corpora comparison. Deviations from Gaussianity are captured via perturbative invariants, suggesting a notion of universality classes in corpus linguistics (Kartsaklis et al., 2017).

6. Empirical Evaluation and Performance Benchmarks

Benchmarking across word similarity (WordSim-353, MEN, SimLex-999) and analogy completion tasks (Google Semantic/Syntactic, MSR) reveals key findings:

LexVec (PPMI plus weighted matrix factorization) achieves similarity scores $\rho \sim 0.76$ –$0.77$ and analogy accuracies $0.79$–$0.81$ (semantic) and $0.54$–$0.66$ (syntactic), matching or exceeding previous state-of-the-art on most tasks (Salle et al., 2016, Salle et al., 2016).
SVD-based PMI factorization (dimension $d=250$ or $500$) consistently outperforms nonnegative and QR-based factorization for both similarity ( $\sim 0.70$ ) and analogy ( $\sim 0.35$ –$0.38$) (Sorokina et al., 2019).
WOVe (GloVe with explicit order modeling) offers a $36.3\%$ gain in analogy accuracy and a $2\%$ boost in synonym discovery rates over baseline GloVe, at the expense of larger embeddings and increased training cost (Ibrahim et al., 2021).

7. Complexity, Trade-offs, and Theoretical Insights

Context expansion (dependency/structural, positional, per-position fusion) increases model parameter count and, for WOVe, storage by up to $10\times$ over baseline embeddings. However, gains in syntactic or compositional capacity may be substantial (Salle et al., 2016, Ibrahim et al., 2021).
External memory and Q-contexts enable scaling to unrestricted corpora, trading disk I/O and shuffling complexity for main-memory savings (Salle et al., 2016, Kong et al., 2021).
The selection of factorization method has highly significant impact on downstream performance, more so than dimensionality increase beyond setup-specific thresholds (Sorokina et al., 2019).
Matrix-theoretic approaches, analyzing empirical word–context matrices via random matrix statistics, suggest a small space of geometric signatures that could systematically distinguish linguistic genres or languages (Kartsaklis et al., 2017).

Word–context matrices thus remain a foundational and continually evolving object, underlying both traditional and advanced distributional semantics models in computational linguistics and NLP.