Narrative Fingerprints: Multi-Scale Author Identification via Novelty Curve Dynamics

Published 1 Apr 2026 in cs.CL, cs.DL, and cs.IR | (2604.01073v1)

Abstract: We test whether authors have characteristic "fingerprints" in the information-theoretic novelty curves of their published works. Working with two corpora -- Books3 (52,796 books, 759 qualifying authors) and PG-19 (28,439 books, 1,821 qualifying authors) -- we find that authorial voice leaves measurable traces in how novelty unfolds across a text. The signal is multi-scale: at book level, scalar dynamics (mean novelty, speed, volume, circuitousness) identify 43% of authors significantly above chance; at chapter level, SAX motif patterns in sliding windows achieve 30x-above-chance attribution, far exceeding the scalar features that dominate at book level. These signals are complementary, not redundant. We show that the fingerprint is partly confounded with genre but persists within-genre for approximately one-quarter of authors. Classical authors (Twain, Austen, Kipling) show fingerprints comparable in strength to modern authors, suggesting the phenomenon is not an artifact of contemporary publishing conventions.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that narrative fingerprints, quantified via novelty curves, reveal distinct multi-scale authorship signals beyond traditional stylometry.
It leverages sentence-transformer embeddings and SAX motif extraction to capture both global novelty dynamics and local narrative rhythms.
Empirical results show scalar features outperform at book-level while SAX motifs excel in local window analysis, offering new insights for AI text detection.

Multi-Scale Author Identification via Narrative Fingerprints in Novelty Curve Dynamics

Information-Theoretic Framing and Prior Work

The paper establishes that narrative fingerprints—distinctive authorial signatures—can be quantified in the information-theoretic dynamics of texts, specifically the paragraph-wise novelty curve. This diverges from traditional stylometry, which relies heavily on lexical statistics. Prior authorship attribution research has centered on word/character n-grams and syntactic structures, but very little has examined semantic surprisal as a temporal signal. The approach leverages sentence-transformer embeddings (Nomic Embed Text v1.5, 768-dimensional) to represent paragraphs, and computes the cosine distance between successive embeddings to generate a novelty curve per text.

Representation and Feature Engineering

Two primary corpora serve as the experimental ground: Books3 (modern fiction, 759 authors post-filtering) and PG-19 (historic, heterogeneous, 1,821 authors). The novelty curve is distilled into both scalar features (mean, speed, volume, circuitousness, reversal count, variance, trend-to-irregularity ratio) and Symbolic Aggregate approXimation (SAX) motif distributions.

PAA reduces the dimensionality, segmenting the novelty curve, and Z-normalization ensures comparability; discretization into a symbolic alphabet enables motif extraction as overlapping k-grams—these motifs serve as high-dimensional fingerprints. Sliding window SAX analysis further captures local narrative rhythm at granular scales.

Multi-Scale Attribution and Resolution Scaling

Empirical results show that the fingerprinting signal is strongly scale-dependent. At book-level, the scalar features dominate; 43.3% of authors are statistically distinguishable from chance using scalars, with 3.8% top-1 attribution (29× above random). SAX motifs at this scale are less effective (14.0% significant, 0.9% top-1 attribution), indicating that the overall novelty intensity and its global pacing are highly distinctive.

Increasing SAX resolution (PAA segment count, k-gram length) improves fingerprint detection, suggesting that finer granularity captures more subtle, author-specific temporal dynamics. For instance, moving from 16 to 64 PAA segments nearly doubles the fraction of authors with significant fingerprints.

Figure 1: Distribution of author fingerprint effect sizes across Books3—illustrates the right-skewed tail, with some authors demonstrating much stronger fingerprint signals.

Figure 2: Resolution scaling: both significance rates and effect sizes increase monotonically with higher PAA segment counts (granularity); longer k-grams boost attribution but reduce consistency-test sensitivity.

Feature Discriminativeness and Curse of Dimensionality

Fisher Discriminant Ratios (FDRs) quantify feature effectiveness: scalar features are 6–8× more discriminative than SAX motifs at book-level, explaining their importance in attribution. Notably, combining scalars and motifs dilutes classification, exemplifying overfitting in high-dimensional spaces where the discriminative signal is concentrated in a small subset.

Figure 3: Scalar features significantly outperform SAX motifs for discriminative power at book level, validating the dominance of global novelty dynamics.

Scale Inversion: Local Motifs Outperform at Window Level

An unexpected reversal arises: in sliding-window analyses (e.g., 20-paragraph windows approximating chapters), SAX motif fingerprints vastly outperform scalar features. Here, top-1 attribution surges to 4.1% (30.5× chance), demonstrating that the rhythm and micro-patterns of novelty are more distinctive locally; scalar slope features, which capture only trend, underperform.

Figure 4: Multi-scale comparison—scalar features dominate at book-level (blue band), but SAX motifs dominate at finer window-level (green band), the central empirical finding.

Genre Disentanglement and Survival of Fingerprints

Genre confounding is intrinsic in stylometry. The paper applies clustering to PAA profile space (regardless of genre metadata), and tests for fingerprints within clusters. Formulaic clusters exhibit sharply reduced fingerprint rates (7.6%), but literary clusters retain high rates (25%), indicating that a substantial fraction of fingerprinting survives genre control.

Figure 5: Genre disentangling—within formulaic clusters, fingerprint rates drop, but within literary clusters, fingerprint rates rise, validating partial survival post-genre control.

Cross-Era Validation and Authorial Strategy

Classical authors (Twain, Austen, Kipling) exhibit fingerprint strengths comparable to contemporary writers, invalidating the hypothesis that the effect arises solely from modern publishing conventions. Authors with tightly templated narrative structures (Warner, Hunter) and distinctive ground-traversing narratives (Twain, Austen) are strongly fingerprinted.

Multi-Scale Case Study: James–Dickens Paradox

A critical case is Dickens and James. Dickens shows weak fingerprint at book level but strong fingerprint at window level, reflecting consistent local pacing irrespective of global narrative diversity. James exhibits an anti-fingerprint at book level, representing deliberate stylistic evolution across his oeuvre—his books are more diverse in novelty dynamics than random samples.

Figure 6: The James–Dickens Paradox: Dickens’s paragraph-level fingerprint persists despite global diversity; James's anti-fingerprint is evident at book-level.

Limitations and Implications

The fingerprint signal is moderate: attribution rates, while well above chance, are inferior to lexical stylometry. Genre confounding, embedding model bias, corpus representativeness, and the presence of ghostwriters/pseudonyms are acknowledged limitations. Nevertheless, the findings suggest practical implications for AI-generated text detection, editorial analytics, and disputed authorship attribution. Specifically, local novelty rhythms—in chapter-sized windows—could enhance the detection of synthetic narratives, given the tendency of LLMs to lack consistent pacing patterns.

Conclusion

The study formalizes narrative fingerprints as multi-scale, information-theoretic signals. Authors display quantifiable patterns in both global novelty intensity and local motif rhythm. These patterns are persistent, survive partial genre control, and are detectable across historical epochs. Attribution via novelty dynamics is complementary to lexical methods, and its scale-sensitive nature uncovers fingerprints invisible to conventional approaches. For future AI research, integrating narrative fingerprinting with lexical stylometry, forensic analysis of disputed authorship, and detection of synthetic text generation presents a promising pathway. Ultimately, the temporal dynamics of semantic surprisal encode core authorial strategies and provide a new axis of quantifiable literary analysis.

Markdown Report Issue