Papers
Topics
Authors
Recent
Search
2000 character limit reached

Word-Adjacency Network (WAN)

Updated 18 May 2026
  • Word-Adjacency Networks are graph-based models where each node represents a word or function word with edges indicating their immediate co-occurrence in text.
  • They utilize statistical and topological techniques, including Markov chain formalism and network metrics like clustering and average shortest-path length, to analyze language structure and authorship.
  • WAN methodologies integrate chain-growth and preferential attachment paradigms, achieving high accuracy in stylometric applications and offering robust cross-linguistic insights.

A Word-Adjacency Network (WAN) is a graph-theoretic representation of text in which nodes correspond to lexical units—typically words or function words—and edges indicate their co-occurrence as adjacent elements within a text. WANs capture both local and global structural properties of language and serve as statistical and topological models for various linguistic phenomena, including stylometric fingerprinting, authorship attribution, and quantitative text analysis. Two principal WAN frameworks are prominent in the literature: (i) general word-adjacency models, where each word type is a node, and (ii) function-word adjacency networks, in which nodes are restricted to a predefined set of function words, with edges capturing directional, weighted co-appearance statistics. WAN-based approaches are extensible across languages and have been applied at multiple granularities and time scales.

1. Definitions and Construction Paradigms

WANs are constructed by mapping sequential linguistic units onto graph structures under well-defined adjacency rules. The most general construction operates as follows:

  • Node Set: Each unique word (token or lemma) becomes a node. Some models include punctuation marks as ordinary nodes, while others exclude them (Dec et al., 10 Jan 2026).
  • Edge Set: An (undirected or directed, and typically unweighted) edge is formed between pairs of nodes if their corresponding words appear adjacent to each other at least once in the text (Kulig et al., 2014, Lahiri et al., 2013, Amancio, 2014). In function-word WANs, directed edges are established from function word fif_i to fjf_j if fjf_j follows fif_i within a specified window, and edges may be weighted by raw frequency or a discounted sum taking proximity into account (Eisen et al., 2016, Segarra et al., 2014).
  • Preprocessing Variants: Some approaches lemmatize tokens and strip stopwords and punctuation (Amancio, 2014); others retain all words and treat punctuation as a delimiter only (Lahiri et al., 2013); recent models explicitly retain punctuation as central network nodes (Dec et al., 10 Jan 2026).

A typical WAN construction pipeline (generalized from (Dec et al., 10 Jan 2026, Kulig et al., 2014, Lahiri et al., 2013)):

  1. Tokenize the text into a sequence T=(t1,t2,...,tN)T = (t_1, t_2, ..., t_N).
  2. For each consecutive pair (ti,ti+1)(t_i, t_{i+1}), create nodes (if not already present) and insert an edge between tit_i and ti+1t_{i+1}.
  3. Optionally, use a windowed adjacency: insert edges between tokens that appear within kk positions.
  4. For function-word WANs, form a directed, weighted adjacency matrix capturing (possibly discounted) frequency of fjf_j following fjf_j0 (Segarra et al., 2014, Eisen et al., 2016).

2. Function-Word WANs: Markov Chain Formalism and Stylometric Application

Function-word adjacency networks (function-word WANs) restrict nodes to a fixed, typically small set fjf_j1 of function words (articles, prepositions, conjunctions, pronouns, auxiliaries) (Eisen et al., 2016, Segarra et al., 2014). The edge-weighted adjacency matrix fjf_j2 for text fjf_j3 is defined by counting, potentially within a sliding window and using an exponential decay parameter fjf_j4, how often fjf_j5 appears after fjf_j6.

Weighted co-occurrences are normalized row-wise to yield a stochastic transition matrix fjf_j7, interpretable as a discrete-time Markov chain on fjf_j8. This matrix encodes an author's sequential preferences in using function words, operationalizing authorial style as a probabilistic process (Eisen et al., 2016).

For authorship attribution:

  1. Aggregate fjf_j9 matrices over a securely attributed corpus to form author profile matrices fjf_j0 and normalize to obtain fjf_j1.
  2. For an unknown text fjf_j2, construct fjf_j3.
  3. Compute the Kullback–Leibler (relative entropy) divergence between fjf_j4 and each fjf_j5.
  4. Assign authorship to the profile with minimal divergence: fjf_j6 (Eisen et al., 2016, Segarra et al., 2014).

Empirical studies report that WANs achieve 92.6% attribution accuracy across six canonical playwrights, surpassing function-word frequency vector methods and principal component analysis (Eisen et al., 2016).

3. Topological and Statistical Properties of WANs

Structural investigation of WANs leverages classic and novel network topology measures:

  • Degree and Clustering: The degree fjf_j7 is the number of distinct immediate neighbors of word fjf_j8; clustering coefficient fjf_j9 quantifies interconnectedness among neighbors (Amancio, 2014, Lahiri et al., 2013).
  • Average Shortest-Path Length (ASPL): fif_i0 characterizes global navigability (Kulig et al., 2014, Amancio, 2014, Dec et al., 10 Jan 2026).
  • Burstiness/Intermittency: Measures such as fif_i1 capture irregularities in the recurrence intervals of words (Amancio, 2014).
  • Other Metrics: Local and global statistics including coreness, shortest-path distributions, betweenness, and degree exponent estimation have been systematically evaluated (Lahiri et al., 2013, Amancio, 2014).

Empirical WANs display a characteristic two-regime ASPL: linear growth for small fif_i2 (initial chain regime), a crossover maximum, followed by slow decrease and saturation—distinct from standard random or scale-free network models (Kulig et al., 2014, Dec et al., 10 Jan 2026).

4. Modeling WAN Growth and Universal Patterns

Mathematical modeling of WAN evolution incorporates two coupled mechanisms:

  • Chain-Growth Regime: Early in text growth, each new word tends to be unique and links linearly to prior words, producing chain-like growth and linear ASPL scaling fif_i3.
  • Accelerated/Preferential Attachment Regime: As vocabulary saturates, edge addition between established nodes accelerates, producing densification. Heaps' law fif_i4 and power-law degree growth are observed. The network transitions to a phase with decreasing fif_i5 (Kulig et al., 2014, Dec et al., 10 Jan 2026).

To accurately reproduce the observed fif_i6 curves, models must blend chain-like and accelerated growth. A hybrid model alternating between chain-extension and preferential/hub attachment quantitatively matches empirical data across several languages (Kulig et al., 2014, Dec et al., 10 Jan 2026).

Universal behaviors include: (i) a crossover from local, chain-dominated topology to global, small-world topology; (ii) significant reduction in fif_i7 when punctuation is retained as high-degree hubs, with particular flattening in Chinese relative to English; and (iii) striking similarity in fif_i8 curves between originals and translations when punctuation is considered (Dec et al., 10 Jan 2026).

5. Feature Extraction and Machine Learning Applications

WANs provide a rich source of network-derived features for stylometry and machine learning:

  • Local (Node-Level) Features: Degree, coreness, and neighborhood size of a curated set of representative words, particularly stopwords or frequent words, have high discriminative power for authorship (Lahiri et al., 2013).
  • Global (Summary) Features: Graph-level metrics aggregated into feature vectors. These capture size, connectivity, density, clustering, and reciprocity, among others.
  • Dynamic/Time-Varying Features: By analyzing WANs over sliding windows or subtexts, one may identify stylistic fluctuations, authorial shifts, or detect collaborative/plagiarized content (Amancio, 2014).

Supervised classification protocols using WAN features—such as kNN, SVM, Naive Bayes, and decision trees—achieve authorship recognition accuracy up to 86.7% for SVM on subtexts, with 89–90% accuracy on competition data combining local WAN features and standard term-frequency vectors (Amancio, 2014, Lahiri et al., 2013).

6. Comparative Insights and Empirical Findings Across Languages and Genres

Cross-linguistic and genre studies reveal that WAN-based methodologies are robust and adaptable:

  • Chinese vs. English: When punctuation is included as nodes, fif_i9 asymptotes to nearly identical values for both languages (T=(t1,t2,...,tN)T = (t_1, t_2, ..., t_N)0); omitting punctuation increases T=(t1,t2,...,tN)T = (t_1, t_2, ..., t_N)1 in both, with a larger effect in Chinese, attributed to the higher frequency of punctuation mark usage in Chinese texts (Dec et al., 10 Jan 2026).
  • Stylometry and Authorship: WANs outperform pure frequency-based stylometry, especially in capturing sequential dynamics and collaborative authorship cases. Scene-level attributions in Early Modern English drama closely match or refine prevailing scholarly author attributions (Eisen et al., 2016).
  • Stability and Scalability: Topological WAN metrics remain statistically stable across subtext samples T=(t1,t2,...,tN)T = (t_1, t_2, ..., t_N)21,500 tokens. SVM classifiers using subtexts sometimes outperform those trained on complete texts, suggesting that local network patterns are key stylometric features (Amancio, 2014).

7. Broader Implications and Extensions

WANs synthesize complex-systems, network science, and linguistics, providing versatile frameworks for:

  • Graph-theoretic modeling of language as a dynamic process, bridging local grammatical effects and global lexical organization.
  • Inclusion of punctuation as network elements, revealing their structural function as topological shortcuts and hubs, and their strong influence on global distance measures and small-world properties (Dec et al., 10 Jan 2026).
  • Integration of WANs with traditional and deep learning methods, leveraging complementary information from both relational (adjacency) and unigram frequency-based stylometric cues (Segarra et al., 2014, Lahiri et al., 2013).
  • Potential identification of stylistic anomalies, temporal structure, and evolution via time-varying WAN analysis (Amancio, 2014).

This suggests a wide applicability of WAN methodologies, with capacity to illuminate not only authorship and style but also the fundamental topological organization and evolution of written language.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Word-Adjacency Network (WAN).