Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stylometric Feature Engineering in Networks

Updated 13 May 2026
  • Stylometric feature engineering is the process of extracting quantitative writing style representations from text corpora using word-adjacency networks.
  • It integrates local network measures like normalized token strength and weighted clustering to create interpretable stylistic signatures across languages.
  • Data-driven feature selection combined with ensemble classifiers has demonstrated attribution accuracies exceeding 90% in robust authorship analysis.

Stylometric feature engineering is the process of extracting quantitative representations of individual writing style from text corpora, enabling automated authorship attribution, author profiling, and related text forensics. Contemporary stylometric systems span lexical, syntactic, and structural levels, but recent advances have shown that network-based representations of text—particularly word-adjacency networks—yield powerful, interpretable, and compact features that generalize across languages and genres. The following sections present the theory, methodology, and practical deployment of stylometric feature engineering using complex network measures, as established by research in linguistic data mining with complex networks (Stanisz et al., 2018).

1. Word-Adjacency Network Construction

Stylometric network-based analysis begins by transforming a text into a word-adjacency network. Each unique token—either a word or a retained punctuation mark (such as special tokens representing periods, commas, dashes, etc.)—forms a network node. Preprocessing involves:

  • Removing non-textual content (annotations, headers, footers).
  • Lowercasing all tokens and collapsing whitespace.
  • Replacing sentence-ending and major pause-marking punctuation with explicit tokens (e.g., "#dot" for period).
  • Omitting or collapsing less-relevant punctuation (e.g., quotation marks are ignored).

The network G = (V, E, w) is constructed by scanning the tokenized text linearly and, for every adjacent token pair (v,u) with v ≠ u, incrementing the weight w_{v,u} by one. The result is an undirected weighted network encoding adjacency frequencies; edge weights reflect the number of local co-occurrences. The unweighted analog simply records which token-pairs ever occur adjacently, ignoring multiplicity.

2. Complex Network-Based Stylometric Feature Definitions

Quantification of writing style is achieved by extracting numerical descriptors—both local (vertex-level) and global (network-level)—from the word-adjacency network. All features are computed on this empirical network and then normalized by comparison to random text surrogates obtained via word shuffling, ensuring sensitivity to non-trivial stylistic structure.

2.1 Local (Vertex-Level) Measures

  • Weighted Degree (Strength): For node v_i, the weighted degree (strength) str(v_i) = ∑{j∈N(i)} w{i,j} measures aggregate adjacent-token co-occurrence frequency. Its normalized form,

str~(vi)=str(vi)uVstr(u)\widetilde{\operatorname{str}}(v_i) = \frac{\operatorname{str}(v_i)}{\sum_{u \in V}\operatorname{str}(u)}

directly equals the relative frequency of that token, establishing equivalence to classic stylometry but grounded in network representation.

  • Weighted Clustering Coefficient: A central measure, calculating the density of triangles (i.e., closed triplets) among v_i’s neighborhood, is given for weighted graphs by:

Ciw=1str(vi)(deg(vi)1)j,kN(i)wij+wik2aijajkakiC_i^w = \frac{1}{\operatorname{str}(v_i)(\deg(v_i)-1)} \sum_{j,k \in N(i)} \frac{w_{ij} + w_{ik}}{2} a_{ij} a_{jk} a_{ki}

Normalized as C~iw=Ciw/Cwrandom\widetilde{C}_i^w = C_i^w / \langle C^w \rangle_{\text{random}}. The unweighted analog uses the number of actual neighbor connections m_i.

  • Local Average Shortest-Path Length: For node v_i, this is

i=1V1jid(i,j)\ell_i = \frac{1}{|V|-1} \sum_{j \neq i} d(i, j)

where d(i, j) is shortest-path length, with weighted paths defined as inverses of edge weights (1/w_{ij}). Normalization is performed against randomized surrogates.

Additional optional measures include betweenness and closeness centralities, but these display lower discriminative power in authorship attribution.

2.2 Global (Network-Level) Measures

  • Global Clustering: Mean over all nodes, normalized:

C~=C/Crandom\widetilde{C} = C / \langle C \rangle_{\text{random}}

  • Average Shortest-Path Length: As above, globally averaged and normalized.
  • Assortativity (Degree Correlation): Pearson correlation of node degrees (or strengths), across edges, again normalized by random baselines.
  • Modularity (Louvain): Measures the strength of community structure; computed by the standard modularity formula, adapted for weighted networks and normalized.

3. Feature Selection and Network-Based Stylistic Fingerprints

Empirical analysis demonstrates that robust attribution accuracy (>90%) is achievable using only a handful of local features from the most frequent tokens—typically the top 10–12, including punctuation tokens. For these, one extracts per-token normalized strength and weighted clustering coefficients, optionally augmented by local unweighted clustering and shortest-path measures. This selection is justified because the most frequent tokens serve as stylistic "hubs": their adjacency and local subgraph structure differentially encode recurrent phraseology, syntactic and functional deployment, and idiosyncratic punctuation use.

Marginal gains in attribution accuracy diminish rapidly beyond 10–12 tokens, and inclusion of punctuation is essential; omission reduces performance by ~15 percentage points. The selection process is data-driven, applying frequency cutoff and ranking by discriminative ability.

4. Classification Framework and Model Performance

Stylometric network features are composed into fixed-length vectors (e.g., concatenating str~\widetilde{\operatorname{str}} and C~w\widetilde{C}^w for top-n tokens), possibly joined with global network statistics. Authorship attribution is performed using an ensemble of decision trees with bagging; training splits are balanced and performance is evaluated via stratified Monte Carlo cross-validation.

  • Global network features alone yield only moderate accuracy (~35–44% depending on language and weighting), reflecting their generality.
  • Local features derived from the top tokens—specifically, C~w\widetilde{C}^w and str~\widetilde{\operatorname{str}}—achieve up to 93% accuracy for English and 86% for Polish with as few as five tokens. The range 10–12 tokens is optimal for robust, language-agnostic performance.

5. Language Sensitivity and Comparative Analysis

Network-based stylometric features generalize across languages but interact with typological properties. In inflectional languages like Polish, fewer top tokens are required: rich inflection diffuses linguistic signal across more types, emphasizing individual token neighborhoods. In languages with rigid word order (English), differences in weighted co-occurrence patterns are sharper, making weighted network features especially potent. Importantly, the weighted average shortest-path length is not discriminative, due to densification among frequent token “cliques.”

Punctuation tokens systematically contribute about 10% of attribution accuracy, validating their inclusion as first-class stylistic markers. Removal leads to significant accuracy deterioration.

6. Practical Guidelines and Engineering Pipeline

The recommended workflow for stylometric feature engineering via word-adjacency networks is as follows:

  1. Normalize and preprocess the corpus: strip nontextual material, lowercase, tokenize, replace key punctuation with tokens.
  2. Construct the word-adjacency network, computing all (weighted) co-occurrence counts.
  3. Identify and select the n ≈ 10–12 most frequent tokens.
  4. For each selected token, extract:
    • Normalized strength (str~\widetilde{\operatorname{str}})
    • Normalized weighted clustering coefficient (Ciw=1str(vi)(deg(vi)1)j,kN(i)wij+wik2aijajkakiC_i^w = \frac{1}{\operatorname{str}(v_i)(\deg(v_i)-1)} \sum_{j,k \in N(i)} \frac{w_{ij} + w_{ik}}{2} a_{ij} a_{jk} a_{ki}0)
    • Optionally, Ciw=1str(vi)(deg(vi)1)j,kN(i)wij+wik2aijajkakiC_i^w = \frac{1}{\operatorname{str}(v_i)(\deg(v_i)-1)} \sum_{j,k \in N(i)} \frac{w_{ij} + w_{ik}}{2} a_{ij} a_{jk} a_{ki}1 and Ciw=1str(vi)(deg(vi)1)j,kN(i)wij+wik2aijajkakiC_i^w = \frac{1}{\operatorname{str}(v_i)(\deg(v_i)-1)} \sum_{j,k \in N(i)} \frac{w_{ij} + w_{ik}}{2} a_{ij} a_{jk} a_{ki}2
  5. For baseline normalization, shuffle each text k ≈ 50 times, recompute networks/features, and use the random ensemble average for normalization.
  6. Assemble feature vectors and train a bagged-tree classifier; validate via repeated stratified splits.
  7. To enhance accuracy, increase n or combine with traditional bag-of-words features.

This protocol yields compact, interpretable, and highly discriminative style signatures suitable for cross-linguistic authorship attribution and robust even in moderate-sample regimes (Stanisz et al., 2018). The approach systematically generalizes classic lexical stylometry by leveraging local structural topology in the token co-occurrence graph, delivering both computational efficiency and theoretical grounding.


Key Reference:

"Linguistic data mining with complex networks: a stylometric-oriented approach" (Stanisz et al., 2018)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stylometric Feature Engineering.