Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quantifying origin and character of long-range correlations in narrative texts (1412.8319v2)

Published 29 Dec 2014 in cs.CL and physics.soc-ph

Abstract: In natural language using short sentences is considered efficient for communication. However, a text composed exclusively of such sentences looks technical and reads boring. A text composed of long ones, on the other hand, demands significantly more effort for comprehension. Studying characteristics of the sentence length variability (SLV) in a large corpus of world-famous literary texts shows that an appealing and aesthetic optimum appears somewhere in between and involves selfsimilar, cascade-like alternation of various lengths sentences. A related quantitative observation is that the power spectra S(f) of thus characterized SLV universally develop a convincing `1/fbeta' scaling with the average exponent beta =~ 1/2, close to what has been identified before in musical compositions or in the brain waves. An overwhelming majority of the studied texts simply obeys such fractal attributes but especially spectacular in this respect are hypertext-like, "stream of consciousness" novels. In addition, they appear to develop structures characteristic of irreducibly interwoven sets of fractals called multifractals. Scaling of S(f) in the present context implies existence of the long-range correlations in texts and appearance of multifractality indicates that they carry even a nonlinear component. A distinct role of the full stops in inducing the long-range correlations in texts is evidenced by the fact that the above quantitative characteristics on the long-range correlations manifest themselves in variation of the full stops recurrence times along texts, thus in SLV, but to a much lesser degree in the recurrence times of the most frequent words. In this latter case the nonlinear correlations, thus multifractality, disappear even completely for all the texts considered. Treated as one extra word, the full stops at the same time appear to obey the Zipfian rank-frequency distribution, however.

Citations (2)

Summary

  • The paper demonstrates that sentence length variability exhibits robust 1/f scaling with an average beta of about 1/2, indicating universal long-range linear correlations.
  • The paper finds that multifractality is especially pronounced in stream-of-consciousness narratives, highlighting complex nonlinear text structures.
  • The paper reveals that full stops act as critical framing elements that drive these correlations, distinguishing sentence-level dynamics from word recurrence patterns.

This paper (1412.8319) investigates the quantitative properties of sentence length variability (SLV) in a large corpus of 113 world-famous literary texts across multiple languages (English, French, German, Italian, Polish, Russian, Spanish). The core idea is that while efficient communication might favor short sentences, aesthetic and engaging narrative involves variation in sentence length, creating rhythm and flow. The authors hypothesize that this variation might exhibit scale-free characteristics and long-range correlations, similar to other complex natural systems.

To test this, they treat the sequence of sentence lengths in a text, l(j)l(j) (where jj is the sentence index and l(j)l(j) is the number of words in sentence jj), as a time series. A sentence is defined primarily by ending in a full stop, with manual correction for non-sentence-ending punctuation. Texts with at least 5000 sentences were selected for reliable statistical analysis.

The analysis employs two main methods from complex systems physics:

  1. Power Spectrum Analysis (S(f)S(f)): The power spectrum of the SLV series is calculated using the Fourier Transform. Scale-free long-range linear correlations are indicated by a power-law relationship S(f)1/fβS(f) \sim 1/f^\beta.
  2. Multifractal Detrended Fluctuation Analysis (MFDFA): This method is used to detect and quantify nonlinear or heterogeneous scale-free correlations, characteristic of multifractal systems. It involves calculating the fluctuation function Fq(s)F_q(s) for different scales ss and moment orders qq. A power-law scaling Fq(s)sh(q)F_q(s) \sim s^{h(q)} reveals the generalized Hurst exponent h(q)h(q). If h(q)h(q) is dependent on qq, the series is multifractal. The degree of multifractality is quantified by the width of the singularity spectrum, Δα\Delta \alpha, derived from h(q)h(q).

Key Findings for Sentence Length Variability (SLV):

  • Universal 1/fβ1/f^\beta Scaling: The power spectra of SLV for nearly all texts exhibit a clear S(f)1/fβS(f) \sim 1/f^\beta scaling, with the average exponent β1/2\beta \approx 1/2. Individual texts show β\beta values typically ranging from 1/4 to 3/4. This $1/f$ characteristic is found in diverse natural phenomena like music, speech, heart rate, and brain waves, suggesting shared underlying organizational principles. Randomly shuffling sentences eliminates this scaling, resulting in a flat spectrum (β=0\beta=0). The relationship β=2H1\beta = 2H - 1, where HH is the Hurst exponent from MFDFA for q=2q=2, is shown to hold for the analyzed texts.
  • Multifractality and Narrative Style: While most texts show limited multifractality in SLV (small Δα\Delta \alpha), a distinct group of texts, primarily those classified as "stream of consciousness" (SoC) narratives (e.g., Finnegans Wake, Rayuela, The Waves), exhibit significant multifractality (larger Δα\Delta \alpha). This indicates more complex, heterogeneous, and nonlinear correlations in their sentence arrangement. Finnegans Wake stands out with an exceptionally broad and symmetric multifractal spectrum. Ulysses shows a bipartite structure: the first half is essentially monofractal, while the second half is multifractal.
  • Singularity Spectrum Asymmetry: The shape of the multifractal spectrum f(α)f(\alpha) provides further insight. Asymmetry indicates non-uniform scaling properties for long vs. short sentences. For example, Rayuela is more multifractal due to the arrangement of long sentences, while The Waves shows asymmetry related to shorter sentences.
  • Sentence Length Distribution: The tails of the complementary cumulative distribution of sentence lengths (F()=Pr(l)F(\ell) = Pr(l \ge \ell)) follow a stretched exponential distribution F()=exp(μb)F(\ell) = \exp{(-\mu \ell^b)}. SoC texts, characterized by higher multifractality, tend to have thicker tails (smaller bb), meaning they more frequently contain very long sentences compared to other texts.

Origin of Correlations (SLV vs. Word Recurrence Times):

The authors also investigate if similar long-range correlations exist in the recurrence times of specific words (measured as the number of words between consecutive occurrences of the same word). Analyzing frequent words like "the," "and," or "of" reveals:

  • Multifractality Disappears: Unlike SLV, the recurrence times of frequent words show little to no multifractality (Δα\Delta \alpha is small or zero), even in texts highly multifractal in SLV like Finnegans Wake. This strongly suggests that the nonlinear long-range correlations and multifractality in texts originate primarily from the structural arrangement determined by sentence boundaries (full stops).
  • Weaker Linear Correlations: Word recurrence times do show some trace of 1/fβ1/f^\beta scaling, indicating linear correlations, but these are significantly weaker (smaller βw\beta^w) and less consistent than those observed for SLV (βs\beta^s). However, texts that are multifractal in SLV also tend to show stronger linear correlations in word recurrence times compared to monofractal texts.
  • Full Stops as a Framing Element: Full stops, while following Zipf's law similarly to words when treated as a token, appear to form a critical "frame" that dictates the dominant long-range, particularly nonlinear, correlation structure in narrative texts. Words seem to have more flexibility within this frame.

Practical Implications and Applications:

  • Stylometry: The quantitative measures derived from SLV, particularly the exponent β\beta and the multifractal width Δα\Delta \alpha, offer objective, scale-free characteristics that can serve as robust stylometric markers. They can potentially aid in authorship attribution, analysis of stylistic evolution, or classification of literary genres based on their complexity and sentence structure. For instance, β\beta might quantify the "rhythm" or flow, while Δα\Delta \alpha could capture the complexity or heterogeneity of sentence organization.
  • Understanding Language and Cognition: The observed $1/f$ scaling and multifractality link language structure to other biological and cognitive processes, suggesting common underlying principles or constraints in human information processing and production. This could inform models of language generation, perception, and the cognitive effort associated with reading different styles. The finding that SoC texts exhibit higher complexity (multifractality) resonates with their perceived demanding nature and potential engagement of diverse cognitive areas.
  • Text Generation and Evaluation: The identified scale-free patterns and correlation structures could be used as target metrics for evaluating or guiding the generation of synthetic narrative texts, aiming for characteristics perceived as natural, aesthetic, or complex. Generative models could be trained or fine-tuned to reproduce specific β\beta or f(α)f(\alpha) profiles.
  • Information Theory and Complexity: The paper contributes to understanding information encoding in language beyond simple word frequencies, highlighting the role of structural patterns like sentence arrangement in creating complex, correlated sequences. The analogy to the World Wide Web structure suggests possible efficiencies in information flow associated with such self-similar, scale-free organization.

In summary, the paper demonstrates that sentence length variability in literary texts is not random but exhibits robust scale-free correlations. Linear correlations universally follow a 1/fβ1/f^\beta pattern (β1/2\beta \approx 1/2), reminiscent of $1/f$ noise in other natural systems. More complex nonlinear correlations, leading to multifractality, are particularly pronounced in "stream of consciousness" narratives. These correlations originate primarily from the arrangement of sentence boundaries (full stops), suggesting they provide a fundamental structural frame for long-range dependencies in text, distinct from word-level correlations. The quantitative measures derived offer new tools for analyzing literary style and provide insights into the complex systems nature of human language and its connection to cognitive processes.