Unifying concepts in information-theoretic time-series analysis (2505.13080v2)

Published 19 May 2025 in cs.IT and math.IT

Abstract: Information theory is a powerful framework for quantifying complexity, uncertainty, and dynamical structure in time-series data, with widespread applicability across disciplines such as physics, finance, and neuroscience. However, the literature on these measures remains fragmented, with domain-specific terminologies, inconsistent mathematical notation, and disparate visualization conventions that hinder interdisciplinary integration. This work addresses these challenges by unifying key information-theoretic time-series measures through shared semantic definitions, standardized mathematical notation, and cohesive visual representations. We compare these measures in terms of their theoretical foundations, computational formulations, and practical interpretability -- mapping them onto a common conceptual space through an illustrative case study with functional magnetic resonance imaging time series in the brain. This case study exemplifies the complementary insights these measures offer in characterizing the dynamics of complex neural systems, such as signal complexity and information flow. By providing a structured synthesis, our work aims to enhance interdisciplinary dialogue and methodological adoption, which is particularly critical for reproducibility and interoperability in computational neuroscience. More broadly, our framework serves as a resource for researchers seeking to navigate and apply information-theoretic time-series measures to diverse complex systems.

Summary

The paper unifies fragmented information‐theoretic measures into a clear taxonomy, categorizing them by process type, temporal order, and directionality.
The paper demonstrates how these measures quantify complexity and dynamic interactions in time series, with a detailed fMRI case study in computational neuroscience.
The paper provides practical implementation details using open-source tools like JIDT and pyspi, aiding reproducible and interdisciplinary research.

This paper provides a unified framework and guide to eleven key information-theoretic measures for analyzing time series data, aiming to bridge the gaps between fragmented literature, inconsistent notation, and disparate terminologies across disciplines. It focuses on clarifying the conceptual underpinnings, computational formulations, and practical interpretations of these measures, particularly illustrating their application in computational neuroscience using functional magnetic resonance imaging (fMRI) data. The work highlights how information theory offers a model-free approach to quantify complexity, uncertainty, and dynamical structure in time series, applicable across diverse scientific domains. (2505.13080)

The authors begin by defining a time series as a realization of a stochastic process and emphasize the importance of the ergodic assumption, which allows treating samples over time from a single realization as multiple samples from the underlying process distribution. This is crucial for estimating probability distributions from limited empirical data. They discuss handling both discrete and continuous variables, noting that continuous data often requires probability density function (PDF) estimation using methods like Gaussian, box kernel, Kozachenko-Leonenko, or Kraskov-Stögbauer-Grassberger (KSG) estimators.

The core of the paper is a taxonomy that categorizes information-theoretic time-series measures based on three characteristics:

Single-process vs. Pairwise: Whether the measure applies to a single time series or quantifies relationships between two.
Order-independent vs. Order-dependent: Whether the measure is based on the distribution of values regardless of temporal order (static) or is sensitive to temporal dynamics and time-lags.
Undirected vs. Directed: For pairwise measures, whether the relationship captured is symmetric or directional.

This leads to six categories: Single-process Order-independent, Pairwise Order-independent Undirected/Directed, Single-process Order-dependent, and Pairwise Order-dependent Undirected/Directed.

The paper provides a detailed breakdown of eleven measures within this taxonomy:

1. Single-process, Order-independent Measures:

Entropy (H(X)):
- What: Quantifies the average uncertainty or surprisal in the values of a single time series. Higher entropy means more unpredictable values.
- How: Calculated as $H(X) = -\sum_{x \in X} p(x) \log p(x)$ for discrete variables, or the differential entropy $H(X) = -\int f(x) \log f(x) dx$ for continuous variables. Estimation from empirical continuous data involves estimating the PDF (e.g., using Kozachenko-Leonenko). The base of the logarithm determines units (bits for base 2, nats for base e).
- Why: Useful for characterizing the intrinsic variability or complexity of a single signal. In neuroscience, it can indicate how "spread out" or diverse the activity values of a brain region are. A higher value for a region suggests less deterministic, more varied activity patterns.

2. Pairwise Order-independent Measures (Undirected):

Joint Entropy (H(X,Y)):
- What: Quantifies the total uncertainty associated with simultaneously observing two time series, X and Y.
- How: Calculated as $H(X,Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log p(x,y)$ for discrete variables, or the multivariate differential entropy for continuous variables. Estimated using methods like Kozachenko-Leonenko. For Gaussian distributions, a closed-form solution exists based on the covariance matrix.
- Why: Measures the combined variability of two processes. A higher joint entropy relative to the sum of individual entropies suggests more shared structure or redundancy. In neuroscience, it can inform about the degree of statistical dependence or coupling between two brain regions.
Mutual Information (I(X;Y)):
- What: Quantifies the amount of information that one time series provides about another (and vice versa). It measures the reduction in uncertainty about one variable given knowledge of the other.
- How: Calculated as $I(X;Y) = H(X) + H(Y) - H(X,Y)$ , or equivalently $I(X;Y) = \sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$ . Estimated using methods like KSG, which can capture nonlinear dependencies. For Gaussian variables, MI is directly related to the Pearson correlation coefficient ( $MI_{Gaussian} = -\frac{1}{2} \log(1-r^2)$ ).
- Why: A standard measure of association between two variables, sensitive to both linear and nonlinear dependencies. In neuroscience, it can quantify functional connectivity between brain regions, indicating how much their activity patterns are statistically related.

3. Pairwise Order-independent Measures (Directed):

Conditional Entropy (H(Y|X)):
- What: Quantifies the remaining uncertainty in one time series (Y) after observing another (X). It is a directed measure; $H(Y|X)$ is not necessarily equal to $H(X|Y)$ .
- How: Calculated as $H(Y|X) = H(X,Y) - H(X)$ , or equivalently $H(Y|X) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log p(y|x)$ . Estimated as the difference between joint and marginal differential entropies for continuous data.
- Why: Reveals how much information is not shared between two processes. In neuroscience, $H(Y|X)$ shows how unpredictable region Y's activity is even when region X's activity is known. A smaller value means knowing X significantly reduces uncertainty about Y.

4. Single-process, Order-dependent Measures:

Active Information Storage (A(X)):
- What: Quantifies the amount of information from a time series's past (up to memory length k) that is used to predict its present state. It measures temporal predictability or memory within a single process.
- How: Defined as the mutual information between the past $X_t^{(k)} = \{X_t, X_{t-1}, ..., X_{t-k+1}\}$ and present $X_{t+1}$ : $A(X)(k) = I(X_t^{(k)}; X_{t+1})$ . Estimated using methods like KSG. Choosing the memory length k involves trade-offs between capturing long-range dependencies and managing computational complexity/sample size requirements. Ergodic assumption is used to pool samples across time.
- Why: Characterizes the intrinsic dynamics of a single system. In neuroscience, high AIS for a region indicates that its current activity is strongly predictable from its own past, suggesting complex yet deterministic internal dynamics. Low AIS suggests less self-dependence or greater external influence.

5. Pairwise Order-dependent Measures (Undirected):

Stochastic Interaction (SI(X,Y)):
- What: Quantifies the integrated information between two processes, measuring how much more uncertain their future becomes when they are modeled independently rather than as a coupled system.
- How: Calculated as the sum of conditional entropies of each process given its own past, minus the joint entropy of both processes given their combined past: $SI(X,Y) = H(X_{t+1}|X_t) + H(Y_{t+1}|Y_t) - H(X_{t+1}, Y_{t+1}|X_t, Y_t)$ , typically using a memory length of k=1 for simplicity. Estimated using methods like Kozachenko-Leonenko.
- Why: Measures the degree to which two systems evolve together in a way that cannot be explained by their individual dynamics. In neuroscience, high SI between two regions implies strong shared information between their past and present states, reflecting integrated dynamics beyond their separate internal processes.

6. Pairwise Order-dependent Measures (Directed):

Time-lagged Mutual Information (TLMI(X;Y)):
- What: Quantifies the statistical dependence between a past value of one process (X) and a present or future value of another (Y). It captures time-lagged correlation.
- How: Calculated as the mutual information between $X_t$ and $Y_{t+1}$ : $I(X_t; Y_{t+1}) = H(X_t) + H(Y_{t+1}) - H(X_t, Y_{t+1})$ . It's directed in the sense that $I(X_t; Y_{t+1}) \neq I(X_{t+1}; Y_t)$ due to the asymmetric time lag. Estimated using methods like KSG.
- Why: Identifies temporal dependencies between processes where one influences the other with a delay. In neuroscience, TLMI from region X to Y indicates how predictable region Y's current activity is based on region X's past activity.
Causally Conditioned Entropy (CCE(Y|X)):
- What: Quantifies the uncertainty remaining in the present value of one process (Y) after observing its own past and the past and present of another process (X).
- How: Defined as the conditional entropy $H(Y_{t+1} | Y_t^{(k)}, X_t^{(k+1)})$ , conditioning on Y's past (length k) and X's past and present (length k+1). Estimated using methods like Kozachenko-Leonenko, typically with a fixed maximum window length K due to computational constraints.
- Why: Useful for assessing how much uncertainty about a target process remains after accounting for both its own history and that of a potential source. In neuroscience, high CCE from X to Y suggests that knowing the past of X (and the past of Y) does not significantly reduce the uncertainty in Y's present activity, indicating a weak dependency of Y on X in this context.
Directed Information (DI(X->Y)):
- What: Quantifies the total directed influence of a source process (X) on the present of a target process (Y), beyond what is explained by Y's own past. Includes both lagged and contemporaneous influence from X.
- How: Sums the conditional mutual information between the present of Y and the past and present of X, conditioned on the past of Y, over time: $DI(X \rightarrow Y) = \sum_{t=0}^{T-1} I(Y_{t+1}; X_0^{t+1} | Y_0^t)$ . Empirically, for a fixed window length k, it's approximated as $I(Y_{t+1}; X_t^{(k+1)} | Y_t^{(k)})$ , summed over possible window lengths or across time points using the ergodic assumption. Estimated using methods like Kozachenko-Leonenko. Notably, it includes the present of X ( $X_{t+1}$ ) as a conditioning variable for $Y_{t+1}$ .
- Why: A comprehensive measure of directed information flow. In neuroscience, DI(X -> Y) reveals how much knowing the past and present of region X helps predict the present of region Y, above and beyond Y's own history.
Transfer Entropy (TE(X->Y)):
- What: Quantifies the directed dependence from one process (X) to another (Y), specifically how much the past of X improves the prediction of Y's present state beyond what can be predicted from Y's own past. It excludes the contemporaneous value of X from the conditioning.
- How: Defined as the conditional mutual information between the present of Y and the past of X (length l), conditioned on the past of Y (length k): $TE(X \rightarrow Y)(k,l) = I(Y_{t+1}; X_t^{(l)} | Y_t^{(k)})$ . Estimated using methods like KSG. The paper highlights that TE is equivalent to Granger Causality under a Gaussian assumption [109]. Memory lengths k and l can be optimized or fixed.
- Why: A widely used measure of directed information flow (predictability). In neuroscience, TE(X -> Y) quantifies how much region X's past activity adds to the prediction of region Y's current activity, after accounting for Y's own history. It's often interpreted as a measure of predictive "causality" (though not necessarily true intervention-based causality). It is particularly suited for high temporal resolution data.
Granger Causality (GC(X->Y)):
- What: Quantifies the additional predictive power that the past of one process (X) contributes to predicting the present of another (Y), within a linear autoregressive modeling framework.
- How: Calculated as the log-ratio of the residual variance of an autoregressive model of Y based only on Y's past ("reduced model") versus a model of Y based on both Y's past and X's past ("full model"): $GC(X \rightarrow Y)(k,l) = \log \frac{E[(\hat{Y}_{t+1}^{reduced} - Y_{t+1})^2]}{E[(\hat{Y}_{t+1}^{full} - Y_{t+1})^2]}$ . Computed using linear regression. Due to its equivalence to TE with a Gaussian estimator, it can be implemented using information-theoretic toolkits configured for this specific case.
- Why: A classic measure of directed functional connectivity, based on linear predictability. High GC(X -> Y) means X's past linearly helps predict Y's future beyond Y's own history. Often applied to fMRI, though the Gaussian assumption means it only captures linear dependencies, in contrast to the general-purpose information-theoretic measures.

Illustrative Case Study (fMRI):

The paper demonstrates these measures using resting-state fMRI data from a single participant, focusing on connectivity from a seed region (Lateral Occipital Cortex - LOC) to other regions in the left hemisphere. Figure 4 visually presents the computed values for each measure projected onto the cortical surface. This case paper serves as a practical example of how each measure provides a distinct perspective on neural dynamics:

Entropy and AIS show regional intrinsic properties (variability and self-predictability).
Order-independent pairwise measures (JE, MI, CE) reveal static statistical dependencies.
Order-dependent pairwise measures (SI, TLMI, CCE, DI, TE, GC) capture dynamic, time-lagged, or directed interactions.

The results show how different regions exhibit different patterns across measures (e.g., a region might have high contemporaneous MI with the seed but low time-lagged TE). This highlights the complementary nature of these measures and the importance of choosing the appropriate one based on the specific research question (e.g., interested in linear vs. nonlinear coupling, static correlation vs. dynamic flow, including/excluding contemporaneous effects).

Implementation and Software:

The paper explicitly states that all measures can be computed from empirical data using the open-source Java Information Dynamics Toolkit (JIDT) [15] and the Python package pyspi [16, 17]. pyspi wraps JIDT functionalities and provides a unified interface for computing a wide range of pairwise measures.

Practical implementation details mentioned include:

Estimator Choice: For continuous data, the choice of estimator (Gaussian, kernel, Kozachenko-Leonenko, KSG) is critical as it affects sensitivity to linear vs. nonlinear dependencies and computational performance. Gaussian estimators are fast but only capture linear relationships. Non-parametric estimators like KSG are more flexible but can be computationally more expensive and require careful parameter tuning (like number of nearest neighbors).
Memory Length (k, l): Selecting the appropriate history length for order-dependent measures is a practical challenge. Too short a history might miss relevant dependencies, while too long increases dimensionality and estimation difficulty, especially with limited data. The paper mentions optimization strategies exist [50, 99] and that typical implementations like pyspi use default maximum lengths (e.g., k=5).
Ergodic Assumption: Estimating probabilities from a single time series realization relies on the ergodic assumption, which may not hold perfectly for all real-world data.

The authors conclude by emphasizing that this unified perspective serves as a valuable resource for researchers across disciplines, enabling more informed methodological choices and fostering interdisciplinary dialogue. The provided code repository [16] further facilitates the practical application of these concepts.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AnnieGBryant/status/1924664813646762015

https://twitter.com/DynamicsSIAM/status/1924691633171517531

https://twitter.com/sespa5investiga/status/1924825377886572559