Multifractal analysis of sentence lengths in English literary texts
(1212.3171v1)
Published 13 Dec 2012 in physics.data-an, cs.CL, and physics.soc-ph
Abstract: This paper presents analysis of 30 literary texts written in English by different authors. For each text, there were created time series representing length of sentences in words and analyzed its fractal properties using two methods of multifractal analysis: MFDFA and WTMM. Both methods showed that there are texts which can be considered multifractal in this representation but a majority of texts are not multifractal or even not fractal at all. Out of 30 books, only a few have so-correlated lengths of consecutive sentences that the analyzed signals can be interpreted as real multifractals. An interesting direction for future investigations would be identifying what are the specific features which cause certain texts to be multifractal and other to be monofractal or even not fractal at all.
The paper demonstrates that multifractal analysis using MFDFA and WTMM quantifies the scaling behavior of sentence lengths in literary texts.
It finds that only a few texts show true multifractality while most are monofractal or non-fractal, indicating varied structural complexity.
The study highlights applications in stylometry, including author attribution and genre classification, underscoring the importance of robust data preparation and validation.
This paper (Grabska-Gradzińska et al., 2012) investigates the statistical properties of sentence lengths in English literary texts using methods from complex systems analysis, specifically focusing on multifractal characteristics. The core idea is to treat the sequence of sentence lengths within a text as a "time series" and analyze its scaling properties.
Data Representation
The first step in applying this research is data preparation. This involves:
Text Acquisition: Obtaining the raw text of literary works. The paper used texts from the Gutenberg Project.
Sentence Segmentation: Identifying individual sentences within the text. The paper defines sentences based on standard punctuation marks: full stop (.), colon (:), semicolon (;), interrogation mark (?), and exclamation mark (!). Commas are excluded as they don't reliably delineate distinct pieces of information.
Sentence Length Measurement: For each identified sentence, counting the number of words it contains. A "word" can be defined based on whitespace separation, though more sophisticated tokenization might be considered depending on the specific application (e.g., handling hyphenated words).
Time Series Creation: Constructing a sequence where each element is the length of a sentence, ordered according to their appearance in the text. If a text has N sentences with lengths l1,l2,...,lN, the time series is x(i)=li for i=1,...,N. The paper notes that they used texts with at least 2,000 sentences to ensure statistical significance.
This process transforms a literary text into a one-dimensional numerical sequence, which can then be analyzed using signal processing techniques.
Analysis Methods
The paper employs two primary methods for multifractal analysis:
Multifractal Detrended Fluctuation Analysis (MFDFA): This is the principal method used. The steps for implementation are:
Profile Creation: Calculate the integrated profile Y(i)=∑k=1i(x(k)−⟨x⟩), where ⟨x⟩ is the mean sentence length over the entire text. This removes the average value.
Segmentation: Divide the profile Y(i) into M non-overlapping segments of length n. To avoid losing data at the end, repeat the process starting from the end of the series, resulting in $2M$ segments. The length n is varied over a range of scales (e.g., from small values up to N/4 or N/10).
Detrending: For each segment v, fit a polynomial Pv(l)(j) of order l to the data within that segment. The paper uses l=1 (linear) or l=2 (quadratic). This step removes local trends.
Variance Calculation: Calculate the variance F2(v,n)=n1j=1∑n{Y[(v−1)n+j]−Pv(l)(j)}2 for each segment v.
Averaging and Fluctuation Function: Compute the q-th order fluctuation function Fq(n)=[2M1v=1∑2M[F2(v,n)]q/2]1/q. This is done for various values of q (e.g., from -3 to 3 as used in the paper).
Scaling Analysis: If the time series has fractal properties, Fq(n) will scale with n according to a power law: Fq(n)∼nh(q). By plotting logFq(n) against logn for different q values, one can estimate the generalized Hurst exponent h(q) as the slope of the linear regression in the scaling region.
Multifractality Assessment: If h(q) is constant across all q, the series is monofractal. If h(q) varies with q, it is multifractal. A non-linear h(q) indicates multifractality.
Singularity Spectrum: The multifractal spectrum f(α) and the Holder exponent α can be calculated from h(q) using Legendre transform-like relations: α=h(q)+qh′(q) and f(α)=q[α−h(q)]+1. A wider f(α) spectrum corresponds to a richer multifractal structure.
Wavelet Transform Modulus Maxima (WTMM): This method is used as an auxiliary tool. It involves performing a wavelet transform on the signal, identifying the local maxima of the transform coefficients across different scales, and calculating a partition function based on the moduli of these maxima. Similar to MFDFA, the scaling of the partition function Z(q,s)∼sτ(q) yields exponents τ(q), from which α and f(α) can be derived: α=τ′(q) and f(α)=qα−τ(q). The paper notes that WTMM results generally agree with MFDFA, especially for strongly multifractal data, but MFDFA is preferred for shorter signals.
Implementation Considerations
Software Libraries: Implementing MFDFA from scratch requires careful coding. Libraries like nolds in Python provide implementations of DFA and MFDFA that can be used directly on the sentence length time series. WTMM implementations might be less common in standard packages.
Parameter Selection: Choosing the range of n for scaling analysis is crucial. The paper suggests that reliable multifractal scaling should extend over a significant range of n. Choosing the polynomial order l for detrending in MFDFA is also important; l=1 or l=2 are common choices. The range of q values used affects the calculated h(q) and f(α).
Statistical Significance and Interpretation: As highlighted in the paper, interpreting results from real data is subtle.
Visually inspecting logFq(n) vs. logn plots for clear linear scaling regions is necessary (see Figure 1).
Comparing the results (e.g., h(q) and f(α) width) for the original time series with those from surrogate data (like shuffled versions of the original series) is critical to distinguish true correlations from artifacts due to the data's probability distribution or non-stationarity (see Figure 2). A substantially wider f(α) for the original data compared to shuffled data is strong evidence for multifractality.
The paper uses a criterion based on experience, suggesting that texts where the width of f(α) substantially exceeds 0.1 can be considered multifractal.
Results and Practical Implications
The paper found that:
Only a minority of the 30 literary texts showed clear evidence of being true multifractals in their sentence length sequences. Most were either not fractal or monofractal.
Examples were provided for texts exhibiting no fractal structure, monofractal structure, spurious multifractal structure, and real multifractal structure (Figure 1).
Multifractality was characterized by a non-linear h(q) and a wider f(α) spectrum (Figure 3).
For some authors (like Twain and Conan Doyle), the fractal properties seemed somewhat consistent across their works, while for others (like Austen), they varied (Figure 3 shows variation even within Austen's works).
The autocorrelation function (ACF) showed power-law decay for the text identified as strongly multifractal, but not for others (Figure 5), suggesting a potential, but not universal, link between power-law ACF and multifractality in this context.
Applications:
This research has practical implications in the field of stylometry (the paper of linguistic style):
Author Attribution: Since the fractal properties of sentence lengths can vary between authors and sometimes be consistent within an author's work, these properties could potentially be used as features for identifying the author of an anonymous text. An implementation could involve:
Building a dataset of sentence length sequences from known authors.
Calculating multifractal features (h(q) for a range of q, the width of f(α), etc.) for each text.
Using these features to train a machine learning model (e.g., a classifier) to associate feature sets with authors.
Applying the trained model to the features extracted from an unknown text.
Genre Classification or Style Analysis: Differences in sentence length distribution and correlation structure might also distinguish genres or writing styles. This could be explored by building datasets representative of different categories and applying similar feature extraction and classification techniques.
Identifying Structural Complexity: The degree of multifractality (e.g., measured by the width of f(α)) could serve as a quantitative metric for the structural complexity of sentence organization within a text.
Limitations and Future Work:
The paper notes that it doesn't identify what specific features of the text cause some sentence length sequences to be multifractal while others are not. This remains an open question. Future research could explore correlations between multifractal properties and other linguistic features (e.g., grammatical structure, vocabulary use, narrative pace) to gain a deeper understanding of the origin of this structure in language.
In summary, the paper provides a methodology based on MFDFA and WTMM to analyze the fractal structure of sentence lengths in literary texts, demonstrating that while some texts exhibit multifractality, it is not a universal property. The techniques described offer potential avenues for quantitative analysis of literary style and complexity, applicable in areas like computational stylometry. Implementing these methods requires careful data preparation, robust implementation of fractal analysis algorithms, and rigorous statistical validation using surrogate data.