Generalized Hurst exponent and multifractal function of original and translated texts mapped into frequency and length time series

Published 30 Aug 2012 in physics.data-an, cond-mat.stat-mech, nlin.PS, and physics.soc-ph | (1208.6174v1)

Abstract: A nonlinear dynamics approach can be used in order to quantify complexity in written texts. As a first step, a one-dimensional system is examined : two written texts by one author (Lewis Carroll) are considered, together with one translation, into an artificial language, i.e. Esperanto are mapped into time series. Their corresponding shuffled versions are used for obtaining a "base line". Two different one-dimensional time series are used here: (i) one based on word lengths (LTS), (ii) the other on word frequencies (FTS). It is shown that the generalized Hurst exponent $h(q)$ and the derived $f(\alpha)$ curves of the original and translated texts show marked differences. The original "texts" are far from giving a parabolic $f(\alpha)$ function, - in contrast to the shuffled texts. Moreover, the Esperanto text has more extreme values. This suggests cascade model-like, with multiscale time asymmetric features as finally written texts. A discussion of the difference and complementarity of mapping into a LTS or FTS is presented. The FTS $f(\alpha)$ curves are more opened than the LTS ones

Abstract PDF Upgrade to Chat

Authors (1)

Marcel Ausloos

Summary

The paper applies multifractal analysis via the MF-BOX method to reveal that original texts exhibit significant multifractality, unlike their shuffled counterparts.
It demonstrates that frequency-based time series are more sensitive to translation and language structure differences than word length series.
The study highlights practical applications in author attribution and translation evaluation by linking scaling exponents and multifractal spectra to text complexity.

This paper explores the application of multifractal analysis, a technique from nonlinear dynamics and statistical physics, to quantify the complexity and structural features of written texts. The core idea is to map a linear text sequence into one-dimensional time series and then analyze their scaling properties. This approach aims to provide quantitative indicators that can differentiate original texts, their translations, and shuffled versions serving as a random baseline.

The study focuses on two texts by Lewis Carroll: "Alice's Adventures in Wonderland" (AWL) and "Through the Looking Glass" (TLG), along with an Esperanto translation of AWL (ESP). These texts are preprocessed by removing chapter headings and treating words based on simple rules (e.g., disregarding punctuation, splitting contractions).

Two types of time series are constructed from the texts:

Length Time Series (LTS): Each word in the text sequence is replaced by its length (number of letters).
Frequency Time Series (FTS): Words are ranked by their overall frequency in the entire document. Each word in the text sequence is then replaced by its frequency rank.

To establish a baseline for comparison, shuffled versions of the original texts are created by randomly permuting the words. This destroys word correlations while preserving the original distributions of word lengths and frequencies.

A crucial step in the analysis involves transforming these raw time series into "fluctuations". For a series $x_i$ , a new series $M_i$ is created based on the sign of the difference between consecutive elements: $M_i = 2$ if $x_{i+1} > x_i$ , $M_i = 1$ if $x_{i+1} < x_i$ , and $M_i = 0$ if $x_{i+1} = x_i$ . This resulting series $M_i$ (for $1 \le i < N-1$ ) is then subjected to multifractal analysis.

The chosen method for multifractal analysis is the classical Box Counting (MF-BOX) technique. While other methods like Wavelet Transform Modulus Maxima (WTMM) or Multifractal Detrended Fluctuation Analysis (MF-DFA) exist and are often preferred for non-stationary signals, the authors argue that MF-BOX is suitable here because the text-derived series do not exhibit significant trends.

The MF-BOX method proceeds as follows:

Divide the fluctuation series $M_i$ of length $N-1$ into $N_s$ non-overlapping boxes (subseries) of size $s$ .
For each box $v$ , calculate a "probability" measure $P(s, v)$ by summing the values within the box (Eq. 1).
Compute the partition function $\chi(s, q)$ by summing the $q$ -th powers of $P(s, v)$ over all boxes of size $s$ (Eq. 2).
Identify scaling behavior: $\chi(s, q) \sim s^{\tau(q)}$ (Eq. 3). The exponent $\tau(q)$ is estimated from the slope of a log-log plot of $\chi(s, q)$ versus $s$ (Figs. 1-2).
Calculate the generalized Hurst exponent $h(q)$ from $\tau(q)$ using the relationship $h(q) = (\tau(q)/q) + 1$ (Eq. 4). For $q=2$ , $h(2)$ relates to the standard Hurst exponent.
Calculate the generalized fractal dimension $D(q)$ from $\tau(q)$ using $D(q) = \tau(q) / (q-1)$ (Eq. 5).
Compute the singularity spectrum $f(\alpha)$ and the singularity exponent $\alpha$ using a Legendre transform-like approach: $\alpha = d\tau(q)/dq$ (Eq. 6) and $f(\alpha) = q\alpha - \tau(q)$ (Eq. 7).

For practical implementation, the range of box sizes $s$ is selected (e.g., $2 < s < 200$) to ensure a good power-law fit for $\chi(s, q)$ , and $\tau(q)$ is obtained via linear regression on the log-log plots. The analysis is performed for a range of $q$ values to probe different moments of the measure distribution, corresponding to different fluctuation scales (negative $q$ emphasize small fluctuations, positive $q$ emphasize large fluctuations).

The results show that the original texts exhibit multifractality, indicated by $h(q)$ varying with $q$ (Figs. 3-4). Shuffled texts, in contrast, show $h(q)$ curves that are much flatter and closer to $h(q) = 1.0$ , characteristic of monofractal or uncorrelated series. This confirms that the multifractality in original texts is not solely due to the distribution of word properties but reflects underlying correlations.

Comparing original texts, the h(q) and D(q) curves show distinctions, particularly between the English texts (AWL, TLG) and the Esperanto translation (ESP) when analyzed via FTS. This suggests that FTS might be more sensitive to differences introduced by translation or language structure compared to LTS, which seems to group AWL and ESP more closely while separating TLG.

The f(α) spectra (Figs. 5-7) provide further insight. Original texts produce non-parabolic f(α) curves, contrasting sharply with the narrow, symmetric (though not perfectly parabolic due to finite size effects) curves of the shuffled texts. The non-parabolic shape signifies non-uniformity and strong long-range order correlations (LROC). The width of the f(α) spectrum ( $\alpha_+ - \alpha_-$ ) and the extreme values ( $\alpha_-, \alpha_+$ ) quantify the range of scaling exponents present in the text. The Esperanto text exhibits a larger $\alpha_+$ value than the English texts (Table III), indicating that its FTS contains more extreme fluctuations (large values/ranks). The FTS spectra are generally wider than LTS spectra, suggesting richer multifractality in word frequency patterns compared to word length patterns.

Practical Applications and Implementation Considerations:

Author Attribution/Style Analysis: The shape and parameters of the h(q) and f(α) curves (e.g., width, asymmetry, $\alpha_-, \alpha_+$ values, h(2) value) can potentially serve as quantitative fingerprints of an author's style or the structural complexity of a text. Implementing this involves calculating these features for a corpus of texts and using them as features in a classification or clustering task.
Translation Analysis/Evaluation: The marked differences observed between original English texts and the Esperanto translation's multifractal properties (especially in FTS) suggest this method could be used to quantitatively compare a translation to its original source. It might provide indicators for evaluating how well the translated text retains the structural complexity or "style" of the original. An implementation could calculate the multifractal features for both original and translated versions and quantify the distance or difference between their h(q) and f(α) curves.
Text Complexity Assessment: The degree of multifractality (measured by the width of f(α) or the variation in h(q)) is proposed as an indicator of text complexity. Original texts are shown to be far from the random, less complex shuffled versions. This could potentially be used in natural language processing tasks requiring text complexity scoring.
Distinguishing Text Types: The paper suggests the method could distinguish natural language from computer code or other symbolic sequences. This could be applied in data cleaning, source code analysis, or even security contexts (e.g., detecting generated text).

Implementation Details & Trade-offs:

Data Preprocessing: Implementing the text-to-"fluctuation" series mapping requires careful handling of tokenization, punctuation, capitalization, and defining what constitutes a "word" or "element" in the series. The paper mentions disregarding punctuation and simplifying contractions, which are specific choices affecting the series construction.
Series Type Choice (FTS vs. LTS): The choice depends on the specific analysis goal. FTS captures patterns related to vocabulary usage and repetition structure (word ranks), while LTS captures patterns related to sentence structure and word-level rhythm (word lengths). The study shows they reveal different aspects of text structure.
Multifractal Algorithm: While MF-BOX was used here due to perceived lack of trend, MF-DFA is generally more robust to non-stationarities and might be preferred for other text types or signals. Implementing MF-DFA would involve detrending the series locally before fluctuation analysis.
Computational Cost: Calculating $\chi(s, q)$ for a wide range of $s$ and $q$ and performing linear fits for each $q$ can be computationally intensive, especially for long texts. Optimized numerical libraries for linear algebra and signal processing would be beneficial. The choice of $s$ and $q$ ranges impacts both accuracy and computation time.
Validation with Shuffling: The use of shuffled surrogates is critical for confirming that observed multifractality arises from correlations rather than just the static distribution of word properties. This adds a necessary layer of computation.
Statistical Significance: Evaluating the statistical significance of differences between multifractal spectra requires comparing results across multiple texts or using bootstrapping/permutation tests, which increases computational load.
Choice of "Fluctuation" Mapping: The paper uses a specific $\{0, 1, 2\}$ mapping based on increasing/decreasing values. Other ways to define fluctuations (e.g., deviations from a moving average, absolute differences) could also be explored and might yield different results.

In summary, this research provides a physics-inspired methodology to analyze texts as complex systems, offering practical quantitative measures (h(q), f(α)) for text structure, style, and translation effects. Implementing this requires careful text preprocessing, construction of appropriate time series (FTS/LTS), applying a multifractal analysis algorithm (like MF-BOX), and comparing results against shuffled baselines. The method holds potential for various applications in computational linguistics and digital humanities.

Markdown Report Issue