Panda: A pretrained forecast model for universal representation of chaotic dynamics (2505.13755v1)

Published 19 May 2025 in cs.LG, nlin.CD, stat.ML, and cs.NE

Abstract: Chaotic systems are intrinsically sensitive to small errors, challenging efforts to construct predictive data-driven models of real-world dynamical systems such as fluid flows or neuronal activity. Prior efforts comprise either specialized models trained separately on individual time series, or foundation models trained on vast time series databases with little underlying dynamical structure. Motivated by dynamical systems theory, we present Panda, Patched Attention for Nonlinear DynAmics. We train Panda on a novel synthetic, extensible dataset of $2 \times 10^4$ chaotic dynamical systems that we discover using an evolutionary algorithm. Trained purely on simulated data, Panda exhibits emergent properties: zero-shot forecasting of unseen real world chaotic systems, and nonlinear resonance patterns in cross-channel attention heads. Despite having been trained only on low-dimensional ordinary differential equations, Panda spontaneously develops the ability to predict partial differential equations without retraining. We demonstrate a neural scaling law for differential equations, underscoring the potential of pretrained models for probing abstract mathematical domains like nonlinear dynamics.

Summary

The paper introduces Panda, a pretrained transformer that achieves zero-shot forecasting of unseen chaotic ODEs and even PDEs using synthetic chaotic datasets.
It employs an evolutionary approach for dataset generation and a dynamics embedding with polynomial and Fourier features to enhance model generalization.
Ablation studies emphasize the critical role of channel attention in modeling coupled dynamics, with Panda outperforming baselines on multiple experimental datasets.

This paper introduces Panda (Patched Attention for Nonlinear DynAmics), a pretrained transformer model designed for universal representation and forecasting of chaotic dynamical systems. The core challenge addressed is the difficulty of creating data-driven models that can generalize to unseen chaotic systems, primarily due to their inherent sensitivity to initial conditions and errors.

Dataset Generation: A Novel Evolutionary Approach

A key contribution is the creation of a large-scale, synthetic dataset of chaotic dynamics. This dataset is crucial for pretraining Panda and enabling its generalization capabilities.

Founding Population: The process starts with 135 known, human-curated low-dimensional chaotic ordinary differential equations (ODEs), such as the Lorenz attractor. Parameters and initial conditions for these systems are pre-tuned to ensure chaotic behavior.
Evolutionary Algorithm:
- Mutation: Pairs of systems are randomly selected, and their parameters are perturbed by adding Gaussian noise.
- Recombination (Skew Products): Mutated parent systems ( $f_a, f_b$ ) are combined using a skew product: $\dot{x}(t) = \kappa_a{f}_a({x}) + \kappa_b\,\dot{x}_b$ . The scale factors $\kappa_a, \kappa_b$ are derived from the inverse Root Mean Square (RMS) of the parent flow fields. This construction tends to preserve chaoticity.
- Selection for Chaoticity: Candidate systems undergo rigorous selection:
  - Elimination of systems converging to fixed points or diverging.
  - Application of the 0-1 test to distinguish chaos from periodic/quasiperiodic behavior.
  - Further tests including near recurrences (to reject limit cycles), power spectrum analysis (to reject trajectories with few distinct peaks), and a data-driven Lyapunov exponent estimator.
Data Augmentation: To increase dataset size and diversity while preserving dynamical properties:
- Random Time-Delay Embedding: Based on Takens' theorem, $x_i(t)\rightarrow x_i(t-\tau_i)$ , where $\tau_i$ is a random delay.
- Convex Combinations: Linearly combining channels: $X \leftarrow CX$ .
- Affine Transforms: Applying affine transformations: $X \leftarrow AX + b$ . This process yields $2 \times 10^4$ novel chaotic ODEs, forming the training set. A separate set of $9.3 \times 10^3$ systems is held out for zero-shot evaluation, generated by ensuring no overlap in the founding systems used for their creation.

Panda Model Architecture

Panda builds upon the PatchTST architecture, adapting it for the specific demands of multivariate dynamical systems. It is an encoder-only, non-autoregressive, fixed-horizon forecaster.

Input: A trajectory $\mathcal{T} \in \mathbb{R}^{C \times T}$ (C channels, T timesteps).
Patching: The trajectory is divided into non-overlapping patches of length $P$ . Each patch is a token $\mathcal{P} \in \mathbb{R}^{C\times P}$ . This leverages Takens' embedding theorem as an inductive bias.
Dynamics Embedding: Each patch token $\mathcal{P}$ $P$ is embedded into a higher-dimensional space ( $d_{\text{model}}=512$ $d_{model} = 512$ ). This embedding is a concatenation of:
1. The raw patch values.
2. Random Polynomial Features: Products of elements within the patch, e.g., $\mathcal{P}_{c,I_1} \cdot \mathcal{P}_{c,I_2} \cdot \ldots \cdot \mathcal{P}_{c,I_d}$ for degree $d$ . Degrees $d \in \{2, 3\}$ are used.
3. Random Fourier Features: $\begin{bmatrix} \sin(PW + b) & \cos(PW + b)\end{bmatrix}$ . This embedding is inspired by Koopman operator theory and methods like extended Dynamic Mode Decomposition (eDMD).
Temporal Attention: Standard self-attention is applied across the sequence of patches. The channel dimension is treated as a batch dimension. NoPE (No Positional Encoding) is used.
Channel Attention: Crucially, channel attention layers are interleaved with temporal attention layers. This involves transposing the input so that attention is computed across channels for each patch position: $\text{SelfAttention}(\mathcal{T}_P^\top)$ . This explicitly models the strong coupling between variables in dynamical systems.
Feed-Forward Networks (FFN): Standard FFN blocks with GeLU activations and RMSNorm follow the attention layers.
Prediction Head: The output tokens from the encoder are aggregated (e.g., by mean or max pooling along the patch sequence dimension) and then passed through a linear layer to produce a forecast of a fixed horizon $H$ for each channel.

Implementation Details for Training:

Trajectories of 4096 timesteps are used.
During training, input trajectories are fixed to 3 dimensions by randomly sampling channels from multivariate trajectories to allow efficient batching. At inference, full multivariate trajectories are processed.
Panda has approximately 21.3M parameters.
Patch size $P=16$ , context length 512 timesteps (32 patches).
The model has 8 layers, $d_{\text{model}}=512$ , and 8 attention heads.
Training is done with MSE loss. Models can be optionally pretrained using a Masked LLMing (MLM) objective (infilling masked patches) before fine-tuning for forecasting.

def panda_forward(trajectory_batch): # trajectory_batch: (batch_size, num_channels, sequence_length)
    # 1. Patching
    patches = patchify(trajectory_batch, patch_length=P, stride=P) # (batch_size, num_channels, num_patches, patch_length)

    # 2. Dynamics Embedding
    embedded_patches = []
    for patch_token in patches: # Iterate over each patch token
        # patch_token: (num_channels, patch_length)
        poly_features = compute_polynomial_features(patch_token)
        fourier_features = compute_fourier_features(patch_token)
        # Concatenate raw patch, poly_features, fourier_features
        embedding = concatenate([patch_token, poly_features, fourier_features], axis=-1)
        # Project to d_model if necessary (or features designed to sum to d_model)
        embedded_patches.append(embedding)
    x = stack(embedded_patches) # (batch_size, num_channels, num_patches, d_model)

    # 3. Transformer Encoder Blocks (L layers)
    for _ in range(L_layers):
        # Temporal Attention
        # Reshape for temporal attention: (batch_size * num_channels, num_patches, d_model)
        x_temporal_in = x.reshape(batch_size * num_channels, num_patches, d_model)
        x_temporal_out = temporal_attention_layer(x_temporal_in) # Applies self-attn, FFN, norm, residuals
        x = x_temporal_out.reshape(batch_size, num_channels, num_patches, d_model)

        # Channel Attention
        # Transpose for channel attention: (batch_size * num_patches, num_channels, d_model)
        x_channel_in = x.permute(0, 2, 1, 3).reshape(batch_size * num_patches, num_channels, d_model)
        x_channel_out = channel_attention_layer(x_channel_in) # Applies self-attn, FFN, norm, residuals
        x = x_channel_out.reshape(batch_size, num_patches, num_channels, d_model).permute(0, 2, 1, 3)

    # 4. Prediction Head
    # Aggregate along patch dimension (e.g., mean)
    aggregated_representation = x.mean(dim=2) # (batch_size, num_channels, d_model)
    forecast = linear_prediction_head(aggregated_representation) # (batch_size, num_channels, forecast_horizon)
    return forecast

Key Results and Contributions:

Zero-Shot Forecasting of Unseen ODEs: Panda, trained purely on synthetic ODEs, significantly outperforms baseline models (Chronos, Time MOE, TimesFM), even when baselines are fine-tuned on Panda's training dataset. It shows strong performance across various error metrics (sMAPE, MAE) and forecast horizons on $9.3 \times 10^3$ held-out chaotic systems.
Ablation Studies:
- Channel Attention: Proves critical for performance, confirming its importance for modeling coupled dynamics.
- Dynamics Embedding: The polynomial and Fourier features in the embedding improve performance, especially for long-horizon autoregressive rollouts.
- MLM Pretraining: Helps direct forecasting but can degrade performance on long autoregressive rollouts, suggesting a trade-off or need for alternative pretraining tasks.
Zero-Shot Forecasting of Experimental Data: Panda generalizes to real-world experimental datasets not seen during training, including:
- Double pendulum dynamics.
- C. elegans (Eigenworms) movement.
- Coupled electronic oscillator networks. Panda outperforms Chronos-SFT on these tasks. Notably, for the electronic circuit data, Panda's relative advantage increases with the coupling strength between oscillators, further validating the utility of its channel attention mechanism.
Neural Scaling Law for Dynamical Systems: When keeping the total number of training timepoints constant, performance on held-out systems scales positively with the number of unique dynamical systems ( $N_{sys}$ ) encountered during training, rather than just the number of trajectories from fewer systems. This indicates that exposure to diverse dynamical attractors is more beneficial for generalization than more data from the same attractors.
Interpretable Internal Representations:
- Analysis of attention maps reveals that Panda learns complex, non-local temporal dependencies. Some maps resemble recurrence plots (classical tools in nonlinear dynamics), while others show structures indicative of global transforms.
- Probing the model with sinusoidal inputs ( $[\sin(f_1 t), \sin(f_2 t)]$ ) reveals that channel attention enables Panda to exhibit nonlinear resonance patterns, similar to bispectra observed in turbulent flows. This behavior is absent in a univariate ablation of the model.
Spontaneous Forecasting of PDEs: Remarkably, Panda, despite being trained only on low-dimensional ODEs, demonstrates the ability to perform zero-shot forecasting of partial differential equations (PDEs) like the Kuramoto-Sivashinsky equation and the von Kármán vortex street. It outperforms baselines and can predict complex nonlinear phenomena (e.g., merging flame fronts, vortex pinchoff). This emergent capability suggests that the learned representations of dynamics are highly general.

Practical Implications and Limitations:

Applications: Panda offers a promising approach for building general-purpose forecasting models for various scientific and engineering domains where chaotic dynamics are prevalent (e.g., fluid dynamics, neuroscience, climate modeling snippets). Its zero-shot capabilities reduce the need for extensive task-specific data and retraining for each new system.
Deployment: The model size (21.3M parameters) is moderate. Inference can be performed on a single GPU. The patch-based nature and fixed-horizon forecasting can be efficient.
Limitations:
- The current work focuses on low-dimensional ODEs. Scaling to very high-dimensional systems (common in PDEs) might require architectural modifications, such as sparse channel attention, to handle the typically sparse coupling in such systems.
- The MLM pretraining task, while beneficial for some aspects, was found to degrade performance on autoregressive rollouts. Investigating alternative pretraining objectives tailored for long-term chaotic forecasting is a direction for future work.

Future Directions:

Exploring sparse channel attention for high-dimensional systems.
Developing more effective pretraining tasks for chaotic forecasting, especially for improving long-term rollout stability.
Extending the evolutionary dataset generation to discover a wider range of dynamical behaviors or specific types of systems.

In summary, Panda represents a significant step towards building foundation models for scientific machine learning, particularly in the domain of nonlinear dynamics. Its novel dataset generation strategy, dynamics-informed architecture (especially channel attention and Koopman-inspired embeddings), and impressive zero-shot generalization to both unseen ODEs and qualitatively different PDEs showcase the potential of pretraining on diverse synthetic data to learn fundamental principles of complex systems.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (3)

Tweets

https://twitter.com/wgilpin0/status/1925164094010609809

https://twitter.com/wgilpin0/status/1925164116835975452

https://twitter.com/bronzeagepapi/status/1929034129661657195

https://twitter.com/fly51fly/status/1926392996431327646

https://twitter.com/Dr_Alex_Crimi/status/1925251493797470574

https://twitter.com/ApmelGosson/status/1927259442904543624

YouTube

Show All Videos