Unsupervised Lexeme Clustering Techniques
- Unsupervised lexeme clustering is a set of methods that partition word types into discrete classes using distributional, structural, or acoustic patterns without relying on annotated data.
- Graph-based, Bayesian, and self-supervised approaches achieve high cluster purity and comparable performance to traditional models, as seen in spectral and DP n-gram methods.
- Dynamic, hybrid, and ensembling techniques demonstrate promising improvements in downstream NLP tasks despite challenges in representation variability and data sparsity.
Unsupervised clustering of lexemes refers to the set of methodologies that partition a lexicon (set of word types) into discrete classes, senses, or cognate sets based solely on distributional, structural, or acoustic patterns, without reliance on gold-standard labels or external lexicographic annotation. This problem is central to natural language processing, language acquisition modeling, historical linguistics, and zero-resource speech applications. Recent advances integrate graph-based, information-theoretic, Bayesian, and neural self-supervised representations; the following sections detail the state of the art across methodologies, evaluation regimes, and empirical findings.
1. Distributional and Graph-Based Clustering of Lexemes
Distributional representations, typically high-dimensional vectors derived from word-context co-occurrences, underlie many influential algorithms for lexeme clustering:
- Spectral Clustering of Count Vectors: Each word is encoded as a concatenation of its left and right windowed co-occurrence counts with the M most frequent descriptors, yielding . Pairwise similarities are computed not with a Gaussian kernel but with symmetrized skew-divergence (a KL-based asymmetry metric), yielding an affinity matrix . The random-walk formulation computes leading eigenvectors of (or equivalently solves ), projects words to this spectral subspace, and applies -means. The optimal number of clusters (e.g. ) is tuned against downstream task performance. Empirically, such spectral clusters nearly match those of classic hierarchical Brown clustering in supporting tasks like semantic role labeling (SRL: versus $0.599$ for Brown) and dependency parsing ( vs. $0.881$), despite using less linguistic prior (Levi et al., 2018).
- Graph-Community Clustering (Vec2GC): Lexemes are embedded (e.g., via skip-gram/Word2Vec), and a weighted similarity graph is constructed, connecting nodes with cosine similarity above a threshold and weighting edges as . Lexeme clusters correspond to high-modularity graph communities found via the Louvain or Parallel Louvain method, with recursive refinement yielding a full dendrogram. Compared to flat -means or HDBSCAN, Vec2GC produces more semantically-coherent and purer clusters (e.g., in document analogues, 89% ≥50%-pure clusters vs. 76% for HDBSCAN) (Rao et al., 2021).
- Information-Theoretic Clustering with Qualia Structures: Nouns are represented by probability vectors over “FORMAL role” descriptors extracted from large corpora using lexico-syntactic patterns. Unsupervised assignments are obtained via the sequential Information Bottleneck (sIB), which partitions nouns to maximize the mutual information between clusters and descriptor distributions, with Jensen-Shannon divergence as the similarity metric. This approach cleanly separates lexemes into human/location/event classes (cluster purity up to 93%) and iteratively refines polysemous cases (Romeo et al., 2013).
2. Bayesian and Nonparametric Approaches
Modern lexeme clustering increasingly leverages Bayesian nonparametric models that allow the data to infer an appropriate number of clusters and adopt more flexible, context-sensitive generative assumptions:
- Dirichlet Process Mixture of N-gram Models: Each word is modeled as arising from a latent cluster-specific -gram LLM (usually trigram), with a DP prior favoring parsimony in the number of clusters. Morita & O’Donnell (2024) truncate to clusters and use variational inference (coordinate ascent on the ELBO). On 38,731 English lemmas, two dominant clusters emerge that closely align with the Germanic-Latinate etymological divide; the resulting quasi-etymological classes achieve -measure $0.198$ (significantly non-random) against gold etymology, and more accurately predict the double-object construction than historical origin itself (86.4% vs. 71.4% accuracy) (Morita et al., 16 Apr 2025).
- Acoustic Bayesian Lexicon Discovery: For unlabelled speech, latent lexeme types are discovered by embedding all hypothesized word-length segments via DTW distances to reference exemplars (with kernelization and Laplacian eigenmaps for dimensionality reduction). These are then clustered via a Gaussian mixture model with symmetric Dirichlet prior, so the number of clusters is determined by the data (provided actual vocabulary size). The full model jointly resamples segmentations and cluster assignments via collapsed blocked Gibbs sampling. On TIDigits, this approach achieves clustering purity and 20% unsupervised word error rate, outperforming an HMM baseline without specifying the vocabulary size (Kamper et al., 2016).
- PMI-Based Online EM for Multilingual Cognate Clustering: For each meaning, candidate word pairs are compared via an online-EM-trained alignment model, with segmental similarity weights estimated via online pointwise mutual information (PMI). Cognate clusters are produced by running the InfoMap community detection algorithm on graphs weighted by these scores. Tested on 16 language families, this system consistently outperforms HMM- and feature-based baselines: on average, online PMI achieves over (PHMM) and (LexStat), while converging orders of magnitude faster (Rama et al., 2017).
3. Clustering in Dynamic and Topical Spaces
Several methodologies explicitly focus on dynamic or contextual aspects of lexeme clustering:
- Word Sense Induction in Topic Space: To induce senses of a polysemous target, a Latent Dirichlet Allocation (LDA) topic model is trained over unlabeled documents containing the word. Each instance is then mapped to its document-level topic distribution , and -means clustering (with cosine distance) groups similar usages. The approach achieved the second-highest -measure in SemEval-2, demonstrating the efficacy of topic-space representations for sense discrimination without language-specific supervision (Elshamy et al., 2013).
- Dynamic Semantic Spaces and Agglomerative Clustering: Term vectors are constructed online by random-projection accumulation over context windows, with each dimension kept normalized. For target words, nearest neighbors above a similarity threshold form a “cohort,” and senses are induced by agglomerative clustering (using cosine similarity on prototypes), with the number of clusters chosen by a maximum inter-cluster similarity cutoff. This method is suited to corpora where vocabulary and senses evolve rapidly (e.g., patent or Wikipedia streams), requiring no retraining as new data appears (Delpech, 2018).
4. Speech-Based Lexeme Discovery: Representation and Clustering Bottlenecks
The state of the art in lexeme clustering from unlabelled speech is fundamentally constrained by the discriminability of segment representations, not the clustering stage itself:
- Self-supervised Features and K-means/Graph Clustering: Speech is segmented into candidate word-like units by detecting peaks in the dissimilarity curve of adjacent self-supervised features (e.g., HuBERT or WavLM), optionally smoothed and evaluated for “prominence.” Segments are converted to fixed-dimensional embeddings (average of feature frames, PCA-reduced, unit-normalized) and clustered via -means, agglomerative methods, or Leiden graph partitioning (edges by cosine or DTW similarity). On the ZeroSpeech English benchmark, graph clustering with DTW yields the lowest NED (5.2%), and purity/V-measure above 89%. However, controlled experiments indicate that, even with gold-standard boundaries, baseline representations yield significant within-type variability—when perfect segment-level representations are substituted, all clustering methods achieve 100% purity and V-measure. Thus, lexicon learning is almost entirely bounded by the expressive adequacy of current acoustic/self-supervised segment encodings (Adendorff et al., 10 Oct 2025); (Malan et al., 22 Sep 2024); (Malan et al., 25 Jul 2025).
- Bottom-Up vs. Top-Down Segmentation-Clustering Interactions: Systems that predict boundaries and then cluster (bottom-up) are five times faster than dynamic programming approaches where segmentation and clustering are iterated jointly (top-down, e.g., ES-KMeans+). Both yield similar NED and Token F1 on ZeroSpeech Track 2. The bottleneck remains in representation: even with perfect boundary information, over-clustering and splitting of ground-truth words persist due to embedding deficiencies (Malan et al., 25 Jul 2025).
5. Evaluation Metrics, Oracle Analyses, and Complementarity
Unsupervised lexeme clustering is evaluated both intrinsically (purity, V-measure, NED) and extrinsically (downstream NLP task performance):
- Intrinsic Metrics: Cluster purity, V-measure, and normalized edit distance (NED) for acoustic clusters quantitatively assess alignment with gold-standard lemmas, senses, or cognate sets. In multilingual cognate clustering, B-cubed precision/recall/F1 per word/meaning is standard (Rama et al., 2017); in lexical class induction, V-measure and completeness quantify agreement between discovered clusters and manual annotation (Morita et al., 16 Apr 2025).
- Extrinsic Evaluation: Use of induced clusters as features in semantic role labeling, dependency parsing, and syntactic property prediction allows performance benchmarking against supervised and hand-crafted baselines. For example, spectral clusters and Brown clusters both yield for SRL, and for dependency parsing (Levi et al., 2018).
- Oracle and Complementarity Analysis: Combining outputs from structurally-different clustering algorithms (spectral and Brown) using an oracle (that selects the best per instance) can yield relative improvement in SRL over either system alone, demonstrating that clusterings capture complementary information and that ensembling or hybridization is a promising research avenue (Levi et al., 2018).
6. Known Limitations and Future Directions
- Representation-Limited Performance: In both text and speech, the dominant constraint on unsupervised lexeme clustering is not the sophistication of the clustering algorithm, but the variability and lack of discriminativity in segment/lexeme-level representations. Even cutting-edge self-supervised models produce embeddings where intra-type distances nearly match inter-type distances, limiting achievable purity (Adendorff et al., 10 Oct 2025).
- Scaling and Data Sparsity: Extraction-based vector spaces (e.g., FORMAL roles from surface patterns) are effective for small-to-medium vocabularies but may not scale or generalize to low-frequency lexemes without bootstrapping, smoothing, or extending to other qualia roles (Romeo et al., 2013).
- Language and Modality Transfer: Graph-based, PMI-alignment, and embedding-driven systems show robust transfer across language families and data modalities, but their performance is modulated by the availability of high-quality unsupervised or minimally supervised representations per language (Rama et al., 2017); (Malan et al., 22 Sep 2024); (Morita et al., 16 Apr 2025).
- Innovations in Clustering Algorithms: While classic -means remains widely used, hierarchical, density-based, and graph-community approaches provide improved cluster granularity and adaptivity, especially when features are informative. However, in current practice, these supply diminishing returns relative to improvements in representational fidelity (Rao et al., 2021); (Adendorff et al., 10 Oct 2025).
- Prospective Directions: Continued progress will likely require segment-level contrastive learning, explicit invariance to speaker and prosody in acoustic representations, and learned feature extractors tailored to lexeme identity. Hybrid aggregation techniques exploiting the complementarity between different unsupervised clustering regimes (e.g., spectral-Brown, acoustic-topic, PMI-HMM) are also suggested as promising avenues for boosting overall performance (Levi et al., 2018); (Adendorff et al., 10 Oct 2025).
7. Summary Table: Major Approaches and Representative Results
| Method | Data/Representation | Core Algorithm | Metrics (Best) | Reference |
|---|---|---|---|---|
| Spectral on counts | Windowed counts (text) | Affinity graph + eigenvectors + k-means | SRL F1: 0.596, UAS: 0.867 | (Levi et al., 2018) |
| DP n-gram mixture | Phonotactic sequences | DP mixture of n-gram models | V=0.198 (etymology) | (Morita et al., 16 Apr 2025) |
| Vec2GC | Skip-gram embeddings | Cosine graph + Louvain | Purity up to 89% | (Rao et al., 2021) |
| sIB + FORMAL roles | Descriptor-based distributions | Info Bottleneck w/ JSD | Cluster purity 93% | (Romeo et al., 2013) |
| Acoustic GMM (Bayesian) | Segmental embeddings (speech) | Laplacian eigenmaps + Bayesian GMM | Purity > 88%, WER 20% | (Kamper et al., 2016) |
| Online PMI + InfoMap | Aligned phone sequences | Online EM + community detection | F1: 0.8415 (avg) | (Rama et al., 2017) |
| LDA + k-means | Topic distributions (WSD) | LDA topic model + k-means | 2nd-best V-measure | (Elshamy et al., 2013) |
These advances collectively demonstrate that unsupervised clustering of lexemes, while still constrained by both data sparsity and representation variability, offers a powerful and linguistically interpretable route to semantic, syntactic, and historical generalization in resource-lean and cross-lingual settings.