StyloMetrix: Interpretable Stylometric Vectors

Updated 10 December 2025

StyloMetrix is a system that computes high-dimensional, interpretable vectors capturing grammatical, syntactic, and lexical features from text.
It uses a modular Python pipeline with spaCy to perform rule-based feature extraction across languages including Polish, English, Ukrainian, and Russian.
Empirical evidence shows its effectiveness in tasks like authorship attribution, adversarial stylometry, code defect prediction, and poetry style modeling.

StyloMetrix is an open-source, multilingual system for extracting interpretable stylometric vectors from text. It provides a normalized, high-dimensional representation of grammatical, syntactic, and lexical features, supporting empirical authorship attribution, machine-generated text detection, adversarial stylometry, code defect prediction, and poetry style modeling. Serving as both an explanatory and predictive tool, StyloMetrix yields language-specific features in transparent vector formats suitable for machine learning and deep learning pipelines. Its design prioritizes feature interpretability and scalability across domains and languages, including English, Polish, Ukrainian, and Russian (Okulska et al., 2023).

1. Architecture, Language Support, and Feature Extraction

StyloMetrix operates as a modular, extensible Python library, leveraging spaCy for tokenization, POS/morphological tagging, and dependency parsing in supported languages (Polish, English, Ukrainian, Russian). For Polish, specialized models with richer morpho-syntactic annotations are employed; for Ukrainian, hand-crafted rule extensions supplement spaCy’s standard tags to accurately capture fusional grammar (case, conjugation, aspect) (Stetsenko et al., 2023, Okulska et al., 2023).

The unified processing pipeline includes:

Preprocessing: Raw text is tokenized, lowercased, and subjected to part-of-speech and morphosyntactic analysis. For code, AST-based parsing and identifier extraction replace natural language pipelines (Yasir et al., 2022).
Custom rule sets: Manually defined feature rules operate on spaCy token, morphological, and dependency attributes, annotating linguistic phenomena (e.g., tense, aspect, voice, pronoun types, noun phrase structures, direct speech, parataxis, ellipsis).
Metric evaluation and normalization: Each metric computes either a count or ratio over all tokens in the document:

$F_i(d) = \frac{\sum_{t\in T_d} \mathbf{1}[\text{rule}_i(t)]}{|T_d|}$

This normalization guarantees output values in $[0,1]$ and robust comparison across variable-length documents.

Language-specific feature inventories are as follows (Okulska et al., 2023):

Language	Metric count	Categories summarized
Polish	172	Grammatical forms, inflection, syntax, lexicon, psycholinguistic, punctuation
English	196	Tenses, voice, modals, pronouns, POS, syntactic forms, social-media markers
Ukrainian	104	Cases, animacy, pronoun types, syntax (incl. parataxis/ellipsis), verb forms
Russian	104	Comparable to Ukrainian, adapted for six-case system

Outputs are CSV tables with metric names as columns and normalized counts per document. A debug mode records contributing tokens for audits.

2. Scope of Stylometric Metrics

The StyloMetrix feature set is comprehensive and language-aware. Each metric quantifies a linguistically motivated unit or pattern, allowing the system to robustly encode stylistic signal. Feature areas include (Okulska et al., 2023, Stetsenko et al., 2023):

Lexical metrics: Type-token ratios (lemma-based and surface-based), comparative adjectives/adverbs, named entity counts (proper names, numerals), function and content word rates/types (e.g., $L\_FUNC\_A$ , $L\_FUNC\_T$ , $L\_CONT\_A$ , $L\_CONT\_T$ ), and social-media markers.
Grammatical metrics: Detailed tense/aspect/voice/modal counts, conjugation class indicators, participle/infinitive distributions, case-specific and agreement phenomena.
Syntactic metrics: Sentence type frequencies (interrogative, negative, narrative), discourse structure (parataxis, fronting, direct speech), noun phrase embedding depth, negation detection, coordination/subordination patterns.
Punctuation and graphical metrics: Frequency of dots, commas, semicolons, quotation marks, use of emojis, hashtags, and other digital-era features.
Code-specific metrics: Identifier naming statistics (mean/variance), whitespace/indentation regularity, comment density, control structure depth, and token length profiles (Yasir et al., 2022).
Poetry metrics (specialized pipelines): Orthographic (word/line counts, averages), syntactic (POS distributions), phonemic (rhyme, alliteration rates), and metrical features for verse (Kaplan et al., 2023, Nagy, 2019).

3. Empirical Performance and Applications

StyloMetrix features are validated by their predictive efficacy and explanatory power across tasks:

Human vs. LLM text differentiation: StyloMetrix enables accurate multiclass discrimination (MCC up to 0.74 for 7-class LLM/Wikipedia discrimination; binary accuracy up to 0.99 for Wiki vs. LLaMa 2) (Przystalski et al., 1 Jul 2025). Lexical diversity metrics (type–token ratio), function word use, punctuation norms, and noun-phrase complexity rank highest in Shapley importance explanations.
Adversarial stylometry and authorship obfuscation: On Reddit comments anonymized with TraceTarnish, Information Gain ranking isolates five top-performing metrics: ST_TYPE_TOKEN_RATIO_LEMMAS, L_CONT_T, L_FUNC_A, L_CONT_A, L_FUNC_T, with IG scores up to 0.67. Drops in these features reliably indicate stylometric tampering (Dilworth, 3 Dec 2025).
Code defect prediction: Sixty code-level style metrics (naming, indentation, comment density, etc.) serve as robust defect predictors (Decision Tree: F1=78.3% within-project) (Yasir et al., 2022). Feature normalizations and VIF-based selection abate multicollinearity.
Poetry and literary analysis: 84–196 feature vectors (incl. rhyme, alliteration, function-word usage, metrical patterns) outperform n-gram/Cosine/TF-IDF baselines in clustering and author attribution of American poems and Latin hexameter (Kaplan et al., 2023, Nagy, 2019).

Example empirical results:

Classification Task	Classifier	Metric Set	Performance
Wikipedia vs GPT-4 (10 sent.)	LightGBM	StyloMetrix	Accuracy 0.94
Defect prediction (C++ files)	Decision Tree	StyloMetrix-60	F1 0.78
Verse author attribution	SVM/LogReg	Metric chunking	Accuracy ≥ 0.95
Poetry author clustering	PCA+Euclidean	StyloMetrix-84	$\Delta$ = 5.2

4. Integration with Machine Learning and Deep Learning

StyloMetrix vectors are directly usable in both classical and neural architectures:

Classical ML: CSV outputs can be loaded into scikit-learn, and modeled via Random Forests, Voting Classifiers, Decision Trees, SVM, and Logistic Regression (Okulska et al., 2023, Stetsenko et al., 2023). Feature importance and explainability (via Shapley/DALEX) are supported end-to-end.
Transformer/embedding fusion: StyloMetrix vectors are concatenated to deep transformer pooled outputs (e.g., RoBERTa, HateBERT) for input to a prediction head. Empirical results on hate-speech tasks show that appending SM yields consistent gains in both weighted and average F1 (Okulska et al., 2023). Example integration:
1 2 3 4
h = transformer.encode(text) s = stylo_vector(text) z = np.concatenate([h, s]) out = mlp(z)
Low-rank embedding: StyloMetrix-style feature sets can be compressed via Reduced-Rank Ridge Regression (R⁴), yielding interpretable low-dimensional latent style factors ( $r=24$ ) with performance comparable to full feature sets, facilitating compact, explainable models (SLIM-LLMs) (Khalid et al., 4 Aug 2025).

5. Explainability, Feature Selection, and Forensic Analysis

The interpretability of StyloMetrix arises from direct mapping of each vector dimension to a concise linguistic rule. Feature selection for forensic or adversarial stylometry typically employs entropy- or information gain–based methods. Notably (Dilworth, 3 Dec 2025):

Five isolated metrics (function-word/content-word rates/types, type–token ratio) are top indicators of authorship obfuscation or stylometric compromise.
SHAP importance and DALEX tools link classification outcomes to human-readable features, allowing analysts to understand and defend against adversarial text transformations or LLM model attribution (Przystalski et al., 1 Jul 2025).

StyloMetrix’s explicit, theory-grounded features stand in contrast to n-gram or neural representations: while the latter may yield higher raw predictive metrics in some multiclass tasks (MCC 0.87 vs 0.74), StyloMetrix enables transparent audits, model interpretability, and rapid language/genre transfer (Okulska et al., 2023).

6. Limitations, Extensions, and Future Directions

StyloMetrix, while robust and extensible, exhibits certain dependencies and current boundaries:

Language coverage: Only Polish, English, Ukrainian, and Russian are currently supported. Extension to agglutinative and Semitic languages, or non-Latin scripts, requires new rule design and adaptation of POS/morphological pipelines (Stetsenko et al., 2023).
Parser accuracy: Feature quality is constrained by upstream tagger and parser performance, especially for complex morpho-syntactic phenomena in low-resource languages. Metrics for Ukrainian and Russian have reported group-level metric accuracies of 0.88–0.95 (Stetsenko et al., 2023).
Document length: For short texts, many sparse vector features may reduce discriminative power; feature merging may mitigate this (Okulska et al., 2023).
Metric expansion: Future work may include fractal, information-theoretic, and discourse-cohesion features, and improved support for visual/exploratory dashboards (Stetsenko et al., 2023, Okulska et al., 2023).
Hybridization: Network-theoretic and topological features (e.g., from co-occurrence graphs) can be fused with existing StyloMetrix vectors for enhanced performance in authorship and style discrimination (Amancio, 2015).

StyloMetrix continues to serve as a reference implementation of interpretable, domain-expert stylometric vectorization for empirical research in computational linguistics, digital forensics, style transfer, authorship attribution, and NLP explainability (Okulska et al., 2023, Przystalski et al., 1 Jul 2025, Dilworth, 3 Dec 2025, Kaplan et al., 2023).