Word-Adjacency Network (WAN)
- Word-Adjacency Networks are graph-based models where each node represents a word or function word with edges indicating their immediate co-occurrence in text.
- They utilize statistical and topological techniques, including Markov chain formalism and network metrics like clustering and average shortest-path length, to analyze language structure and authorship.
- WAN methodologies integrate chain-growth and preferential attachment paradigms, achieving high accuracy in stylometric applications and offering robust cross-linguistic insights.
A Word-Adjacency Network (WAN) is a graph-theoretic representation of text in which nodes correspond to lexical units—typically words or function words—and edges indicate their co-occurrence as adjacent elements within a text. WANs capture both local and global structural properties of language and serve as statistical and topological models for various linguistic phenomena, including stylometric fingerprinting, authorship attribution, and quantitative text analysis. Two principal WAN frameworks are prominent in the literature: (i) general word-adjacency models, where each word type is a node, and (ii) function-word adjacency networks, in which nodes are restricted to a predefined set of function words, with edges capturing directional, weighted co-appearance statistics. WAN-based approaches are extensible across languages and have been applied at multiple granularities and time scales.
1. Definitions and Construction Paradigms
WANs are constructed by mapping sequential linguistic units onto graph structures under well-defined adjacency rules. The most general construction operates as follows:
- Node Set: Each unique word (token or lemma) becomes a node. Some models include punctuation marks as ordinary nodes, while others exclude them (Dec et al., 10 Jan 2026).
- Edge Set: An (undirected or directed, and typically unweighted) edge is formed between pairs of nodes if their corresponding words appear adjacent to each other at least once in the text (Kulig et al., 2014, Lahiri et al., 2013, Amancio, 2014). In function-word WANs, directed edges are established from function word to if follows within a specified window, and edges may be weighted by raw frequency or a discounted sum taking proximity into account (Eisen et al., 2016, Segarra et al., 2014).
- Preprocessing Variants: Some approaches lemmatize tokens and strip stopwords and punctuation (Amancio, 2014); others retain all words and treat punctuation as a delimiter only (Lahiri et al., 2013); recent models explicitly retain punctuation as central network nodes (Dec et al., 10 Jan 2026).
A typical WAN construction pipeline (generalized from (Dec et al., 10 Jan 2026, Kulig et al., 2014, Lahiri et al., 2013)):
- Tokenize the text into a sequence .
- For each consecutive pair , create nodes (if not already present) and insert an edge between and .
- Optionally, use a windowed adjacency: insert edges between tokens that appear within positions.
- For function-word WANs, form a directed, weighted adjacency matrix capturing (possibly discounted) frequency of following 0 (Segarra et al., 2014, Eisen et al., 2016).
2. Function-Word WANs: Markov Chain Formalism and Stylometric Application
Function-word adjacency networks (function-word WANs) restrict nodes to a fixed, typically small set 1 of function words (articles, prepositions, conjunctions, pronouns, auxiliaries) (Eisen et al., 2016, Segarra et al., 2014). The edge-weighted adjacency matrix 2 for text 3 is defined by counting, potentially within a sliding window and using an exponential decay parameter 4, how often 5 appears after 6.
Weighted co-occurrences are normalized row-wise to yield a stochastic transition matrix 7, interpretable as a discrete-time Markov chain on 8. This matrix encodes an author's sequential preferences in using function words, operationalizing authorial style as a probabilistic process (Eisen et al., 2016).
For authorship attribution:
- Aggregate 9 matrices over a securely attributed corpus to form author profile matrices 0 and normalize to obtain 1.
- For an unknown text 2, construct 3.
- Compute the Kullback–Leibler (relative entropy) divergence between 4 and each 5.
- Assign authorship to the profile with minimal divergence: 6 (Eisen et al., 2016, Segarra et al., 2014).
Empirical studies report that WANs achieve 92.6% attribution accuracy across six canonical playwrights, surpassing function-word frequency vector methods and principal component analysis (Eisen et al., 2016).
3. Topological and Statistical Properties of WANs
Structural investigation of WANs leverages classic and novel network topology measures:
- Degree and Clustering: The degree 7 is the number of distinct immediate neighbors of word 8; clustering coefficient 9 quantifies interconnectedness among neighbors (Amancio, 2014, Lahiri et al., 2013).
- Average Shortest-Path Length (ASPL): 0 characterizes global navigability (Kulig et al., 2014, Amancio, 2014, Dec et al., 10 Jan 2026).
- Burstiness/Intermittency: Measures such as 1 capture irregularities in the recurrence intervals of words (Amancio, 2014).
- Other Metrics: Local and global statistics including coreness, shortest-path distributions, betweenness, and degree exponent estimation have been systematically evaluated (Lahiri et al., 2013, Amancio, 2014).
Empirical WANs display a characteristic two-regime ASPL: linear growth for small 2 (initial chain regime), a crossover maximum, followed by slow decrease and saturation—distinct from standard random or scale-free network models (Kulig et al., 2014, Dec et al., 10 Jan 2026).
4. Modeling WAN Growth and Universal Patterns
Mathematical modeling of WAN evolution incorporates two coupled mechanisms:
- Chain-Growth Regime: Early in text growth, each new word tends to be unique and links linearly to prior words, producing chain-like growth and linear ASPL scaling 3.
- Accelerated/Preferential Attachment Regime: As vocabulary saturates, edge addition between established nodes accelerates, producing densification. Heaps' law 4 and power-law degree growth are observed. The network transitions to a phase with decreasing 5 (Kulig et al., 2014, Dec et al., 10 Jan 2026).
To accurately reproduce the observed 6 curves, models must blend chain-like and accelerated growth. A hybrid model alternating between chain-extension and preferential/hub attachment quantitatively matches empirical data across several languages (Kulig et al., 2014, Dec et al., 10 Jan 2026).
Universal behaviors include: (i) a crossover from local, chain-dominated topology to global, small-world topology; (ii) significant reduction in 7 when punctuation is retained as high-degree hubs, with particular flattening in Chinese relative to English; and (iii) striking similarity in 8 curves between originals and translations when punctuation is considered (Dec et al., 10 Jan 2026).
5. Feature Extraction and Machine Learning Applications
WANs provide a rich source of network-derived features for stylometry and machine learning:
- Local (Node-Level) Features: Degree, coreness, and neighborhood size of a curated set of representative words, particularly stopwords or frequent words, have high discriminative power for authorship (Lahiri et al., 2013).
- Global (Summary) Features: Graph-level metrics aggregated into feature vectors. These capture size, connectivity, density, clustering, and reciprocity, among others.
- Dynamic/Time-Varying Features: By analyzing WANs over sliding windows or subtexts, one may identify stylistic fluctuations, authorial shifts, or detect collaborative/plagiarized content (Amancio, 2014).
Supervised classification protocols using WAN features—such as kNN, SVM, Naive Bayes, and decision trees—achieve authorship recognition accuracy up to 86.7% for SVM on subtexts, with 89–90% accuracy on competition data combining local WAN features and standard term-frequency vectors (Amancio, 2014, Lahiri et al., 2013).
6. Comparative Insights and Empirical Findings Across Languages and Genres
Cross-linguistic and genre studies reveal that WAN-based methodologies are robust and adaptable:
- Chinese vs. English: When punctuation is included as nodes, 9 asymptotes to nearly identical values for both languages (0); omitting punctuation increases 1 in both, with a larger effect in Chinese, attributed to the higher frequency of punctuation mark usage in Chinese texts (Dec et al., 10 Jan 2026).
- Stylometry and Authorship: WANs outperform pure frequency-based stylometry, especially in capturing sequential dynamics and collaborative authorship cases. Scene-level attributions in Early Modern English drama closely match or refine prevailing scholarly author attributions (Eisen et al., 2016).
- Stability and Scalability: Topological WAN metrics remain statistically stable across subtext samples 21,500 tokens. SVM classifiers using subtexts sometimes outperform those trained on complete texts, suggesting that local network patterns are key stylometric features (Amancio, 2014).
7. Broader Implications and Extensions
WANs synthesize complex-systems, network science, and linguistics, providing versatile frameworks for:
- Graph-theoretic modeling of language as a dynamic process, bridging local grammatical effects and global lexical organization.
- Inclusion of punctuation as network elements, revealing their structural function as topological shortcuts and hubs, and their strong influence on global distance measures and small-world properties (Dec et al., 10 Jan 2026).
- Integration of WANs with traditional and deep learning methods, leveraging complementary information from both relational (adjacency) and unigram frequency-based stylometric cues (Segarra et al., 2014, Lahiri et al., 2013).
- Potential identification of stylistic anomalies, temporal structure, and evolution via time-varying WAN analysis (Amancio, 2014).
This suggests a wide applicability of WAN methodologies, with capacity to illuminate not only authorship and style but also the fundamental topological organization and evolution of written language.