Chaotic Oscillatory Transformer Network

Updated 16 November 2025

Chaotic Oscillatory Transformer Network (COTN) is a deep neural forecasting model that integrates a Transformer backbone with chaotic Lee Oscillator activations to manage extreme volatility.
It employs innovative components like Max-over-Time pooling, lambda gating, and an Autoencoder Self-Regressive module to enhance anomaly detection and prediction accuracy.
Experimental evaluations show up to 17% lower forecasting error compared to models like Informer and GARCH, proving its robustness in volatile environments.

The Chaotic Oscillatory Transformer Network (COTN) is a neural forecasting architecture designed for highly volatile, nonlinear time-series systems encountered in domains such as financial markets and electricity trading. It combines a Transformer backbone with a chaotic Lee Oscillator activation, Max-over-Time pooling, a lambda ( $\lambda$ ) gating mechanism, and an Autoencoder Self-Regressive (ASR) module for anomaly isolation. COTN addresses the limitations of standard activation functions (e.g., ReLU, GELU) under the regime of extreme fluctuations, enabling accurate prediction and robust anomaly handling during abrupt systemic changes.

1. Architectural Framework

COTN architecture consists of multiple sequential processing stages, each tailored for stability and adaptivity in chaotic environments:

Input Preprocessing: Missing timestamps are forward-filled; statistical outliers (e.g., ±20% returns) are removed; derived features include log-returns, moving averages, and volatilities. The Autoencoder Self-Regressive (ASR) module precomputes anomaly scores on the raw input.
Embedding Layer: Maps each time-step feature vector to a continuous latent representation.
Encoder Blocks: Each Transformer encoder contains:
- Distilled Multi-Head Self-Attention (adapted from DAT) for efficient context aggregation ( $\mathcal O(L \log L)$ complexity with strided/pooling options).
- Add & Norm.
- Feed-forward sub-layer: Linear projection, Lee Oscillator Activation, Max-over-Time pooling, $\lambda$ -Gating with GELU, Linear projection.
- Add & Norm.
Decoder or Prediction Head: Predicts all future steps ( $H$ ) in one pass.
Output Layer: Final linear projection to produce raw forecasted values.

A key novelty is the replacement of conventional nonlinearities in feed-forward layers with a hybrid Lee Oscillator + GELU activation pipeline, mediated by pooling and gating mechanisms to modulate responsiveness and stability.

2. Lee Oscillator Activation and Dynamics

The Lee Oscillator is a discrete-time dynamical system originally formulated with excitatory ( $E$ ), inhibitory ( $I$ ), input ( $\Omega$ ), and output ( $L$ ) variables:

$\begin{aligned} E(t+1) &= \Sig(e_1\,E(t) - e_2\,I(t) + S(t) - \xi_E) \ I(t+1) &= \Sig(i_1\,E(t) - i_2\,I(t) - \xi_I) \ \Omega(t+1) &= \Sig(S(t)) \ L(t) &= [E(t) - I(t)] e^{-k S^2(t)} + \Omega(t)\ \end{aligned}$

where $\Sig(z)$ is typically either the sigmoid or $\tanh(\mu z)$ ; $S(t)$ is the external stimulus; $\xi_E$ and $\xi_I$ are thresholds.

COTN employs the extended LORS variant, incorporating retrograde signaling for additional dynamical richness:

$\begin{aligned} E(t+1) &= \tanh(a_1 L(t) + a_2 E(t) - a_3 I(t) + a_4 S(t) - \xi_E) \ I(t+1) &= \tanh(b_1 L(t) - b_2 E(t) - b_3 I(t) + b_4 S(t) - \xi_I) \ S(t) &= i + e \cdot \operatorname{sgn}(i) \ \Omega(t+1) &= \tanh(S(t)) \ \mathrm{LORS}(t) &= [E(t)-I(t)]\,e^{-k S^2(t)} + \Omega(t) \ \end{aligned}$

The Lee Oscillator demonstrates a progressive chaotic growth regime, amplifying responsiveness to small input perturbations without causing gradient blow-up. It combines regions of smooth (tanh-like) and chaotic responses, outperforming traditional activations in capturing rapid, sub-cycle volatility, especially when subjected to extreme system shocks. Eight pre-tuned parameter sets ( $T_1, \dots, T_8$ ) govern these dynamics, with the optimal set validated per dataset.

3. Max-over-Time Pooling and Lambda-Gating

After the internal 100-step simulation of the Lee Oscillator per scalar input, Max-over-Time (MoT) pooling selects the most salient response:

$f_T(x) = \max_{1 \leq t \leq 100} \mathrm{LORS}_T(x, t)$

For each batch and feature dimension, the optimal oscillator type ( $T^*$ ) is selected by validation. The final activation is then computed as a convex fusion of the chaotic oscillator’s peak and GELU activation:

$\mathrm{Act}(x) = \lambda\,\mathrm{GELU}(x) + (1-\lambda)\,f_{\mathrm{Lee}}(x), \quad \lambda \in [0, 1]$

Here, $\lambda$ regulates the trade-off between smoothness (large $\lambda$ ) and sensitivity to chaotic patterns (small $\lambda$ ). MoT preserves the highest-magnitude, potentially rare, oscillatory state, while $\lambda$ -gating allows explicit control over the chaos-predictability spectrum.

4. Autoencoder Self-Regressive (ASR) Module for Anomaly Handling

The ASR module serves dual functions: feature denoising and anomaly detection. It is structured as:

Encoder: Maps a window of recent inputs to a latent state $z_t = f_{\theta_e}(x_{t-K+1:t})$ .
Decoder: Attempts to reconstruct the input window $\hat{x}_{t-K+1:t} = g_{\theta_d}(z_t)$ .
Autoregressive Head: Predicts the next value $\tilde{x}_{t+1} = h_{\theta_a}(z_t, ..., z_{t-m+1})$ .

Training minimizes the combined objective:

$\mathcal{L} = \mathcal{L}_{\rm recon} + \alpha \mathcal{L}_{\rm AR}$

where

$\mathcal{L}_{\rm recon} = \frac{1}{K}\sum_{i=1}^K\|x_{t-K+i} - \hat{x}_{t-K+i}\|^2, \qquad \mathcal{L}_{\rm AR} = \|x_{t+1} - \tilde{x}_{t+1}\|^2$

Anomaly points are identified where pointwise reconstruction error $e_t = \|x_t - \hat{x}_t\|$ exceeds a threshold $\tau$ (typically a high quantile of the empirical distribution). During both training and inference, anomalous points are down-weighted or masked, preventing their propagation into the Lee Oscillator dynamics and ensuring prediction robustness.

5. Implementation, Training, and Stability

COTN’s implementation leverages the following components, with sample pseudocode for key layers:

def COTN_FeedForward(X, λ):
    # X: [batch, length, d_model]
    U = Linear1(X)                     # [batch, length, d_ff]
    U_activated = torch.zeros_like(U)
    for idx, u in enumerate(U.flatten()):
        LORS_traj = run_LORS(u, param=T*)   # length 100
        f_lee = max(LORS_traj)              # scalar
        u_activated[idx] = λ * GELU(u) + (1-λ) * f_lee
    V = Linear2(U_activated.reshape_as(U))
    return V

A typical forward pass incorporates ASR preprocessing, embedding, N encoder blocks with modified feed-forward, and a single-step or parallel prediction head.

Training methodology:

Initial warm-start phase using GELU activation ( $~20$ epochs) before enabling Lee activation, reducing convergence time by ~40%.
Hyperparameter selection: typically λ = 0.5, batch size 64, learning rate 1e-4; oscillator type ( $T^*$ ) validated per dataset.

Stability: Theoretical contraction mapping is satisfied if the λ-gated composite activation yields a total Lipschitz constant $c_{\rm tot}<1$ :

$|\mathrm{Act}(x)-\mathrm{Act}(y)| \leq \lambda c_{\rm GELU}|x-y| + (1-\lambda)c_{\rm Lee}|x-y|$

This property ensures fixed-point convergence within each residual feed-forward block.

6. Experimental Evaluation

Extensive benchmarking was performed on both synthetic and real-world volatility datasets:

Datasets: ETT-H and ETT-M (electricity; 8,640 and 69,120 samples), 1-minute A-share stock data (17,000+ samples).
Preprocessing: Missing data imputation, Z-score outlier removal, ±20% return truncation, feature construction (OHLCV returns, moving averages, volatility bands).
Baselines: Informer (deep-learning, Transformer-based), GARCH (statistical, volatility modeling).
Performance Metrics: Mean Absolute Error (MAE), Mean Squared Error (MSE).

Dataset	Informer MAE	GARCH MAE	COTN MAE	Informer MSE	GARCH MSE	COTN MSE
ETTh1 (24h)	0.549	0.712	0.515	0.577	0.810	0.530
ETTm2 (48h)	0.614	0.830	0.571	0.689	0.995	0.635
A-share (96m)	1.567	1.892	1.427	3.608	4.212	3.394

COTN achieves up to 17% lower error than Informer and up to 40% lower error than GARCH, validating its effectiveness in capturing nonstationary, high-volatility dynamics.

7. Practical Considerations and Extensions

Tuning Recommendations:
- Begin with GELU-only training; fine-tune with λ-gated Lee activation.
- Select the optimal oscillator type ( $T_1$ – $T_8$ ) using a hold-out validation set.
- Adjust $\lambda\in[0.3, 0.7]$ according to volatility levels; extreme cases may benefit from a lower $\lambda$ .
Limitations:
- Increased computational demand and memory footprint due to internal 100-step oscillator simulation, partially alleviated by MoT pooling.
- Additional complexity stemming from oscillator type and λ as hyperparameters, necessitating careful validation.
Future Directions:
- Learnable, data-driven λ rather than a fixed hyperparameter.
- End-to-end trainable oscillator parameters or integration into neural ODE frameworks.
- Application to broader classes of volatile systems (e.g., climate, traffic, cyber-attacks) is plausible based on current robustness results.

The central innovation of COTN lies in the seamless fusion of chaos-theoretic dynamical activation, real-time anomaly isolation, and Transformer-based deep sequence modeling, offering a substantially more responsive and robust tool for time-series forecasting in complex, nonstationary contexts.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Chaotic Oscillatory Transformer Network (COTN).