ParallelTime Weighter: Adaptive Fusion for Forecasting

Updated 21 July 2025

ParallelTime Weighter is a dynamic mechanism that adaptively assigns weights to integrate short-term local attention and long-term state-space signals.
It enhances forecasting accuracy and reduces computational overhead by learning per-token fusion in a parallel architecture.
By combining windowed attention outputs with Mamba state-space representations, it achieves robust multivariate time series predictions across diverse datasets.

A ParallelTime Weighter is a dynamic mechanism and architectural concept employed in modern time series analysis—especially in multivariate time series forecasting—that explicitly assigns adaptive, per-token weights to multiple forms of temporal dependencies (short-term and long-term). This mechanism was introduced within the ParallelTime architecture to optimize the aggregation of outputs from local attention mechanisms and state-space models (notably Mamba), achieving superior forecasting accuracy, parameter efficiency, and scalability to long prediction horizons (Katav et al., 18 Jul 2025).

1. Dynamic Dependency Weighting Mechanism

The ParallelTime Weighter addresses the challenge that, in sequential data, the relative importance of short-term (local) and long-term dependencies can vary significantly across positions (tokens) within the input sequence. Prior approaches, such as those used in natural language processing, often aggregate the outputs of a windowed attention mechanism and a state-space model (like Mamba) by simple averaging, thereby assigning a fixed 50–50 weighting to short- and long-term signals. However, empirical findings indicate that equal weighting is suboptimal in time series forecasting.

The core innovation of the ParallelTime Weighter is its dynamic, input-dependent computation of the weights assigned to each dependency type at each token. For each patch or token, the mechanism generates interdependent weights determining the blend between the local-attention response (capturing recent, short-horizon context) and the Mamba state-space response (capturing extended, long-horizon context).

Mathematically, for a given token, let $x_{\text{att}}$ and $x_{\text{mamba}}$ denote its windowed attention and Mamba outputs. The process is as follows:

Normalization and Compression:

$x_{\text{att}}' = \text{RMSNorm}(x_{\text{att}}) \cdot W_{\text{att}}$

$x_{\text{mamba}}' = \text{RMSNorm}(x_{\text{mamba}}) \cdot W_{\text{mamba}}$

where $W_{\text{att}}$ , $W_{\text{mamba}}$ are learned projections mapping from embedding dimension $d$ to $\sqrt{d}$ .

Concatenation and Two-Layer Transformation:

$x_{\text{cat}}' = \text{Concat}(x_{\text{att}}', x_{\text{mamba}}') \in \mathbb{R}^{P \times 2\sqrt{d}}$

$x_{\text{weights}} = \sigma\left( \text{ReLU}(x_{\text{cat}}' W_1) W_2 \right)$

with $W_1 \in \mathbb{R}^{2\sqrt{d} \times d_h}$ , $W_2 \in \mathbb{R}^{d_h \times 2}$ , $d_h$ a hyperparameter, and $\sigma$ the sigmoid activation.

Output Fusion:

$x_{\text{out}} = x_{\text{att}} \cdot x_{\text{weight}}^{(\text{att})} + x_{\text{mamba}} \cdot x_{\text{weight}}^{(\text{mamba})}$

where $[x_{\text{weight}}^{(\text{att})}, x_{\text{weight}}^{(\text{mamba})}]$ are the two components of $x_{\text{weights}}$ .

The choice of sigmoid over softmax for activation was found empirically to perform better for dependency fusion in this context.

2. Integration into the ParallelTime Architecture

As implemented, the ParallelTime Weighter is a pivotal component within the broader ParallelTime architecture—a decoder-only sequence model for time series forecasting. The core architectural strategy is to run both a windowed self-attention block and a Mamba (state-space) block in parallel for each input patch. Their outputs are subsequently fused not by a static operation (such as mean or sum), but using the ParallelTime Weighter, which assigns context-adaptive fusion weights on a per-token basis.

Key design elements surrounding the weighter include:

Dual Patch Embedding: Inputs are linearly projected for global context and convolved for local trend detection.
Global Registers: Special consolidated tokens provide persistent domain-specific information to the attention paths.
Efficient Output Layers: The architecture concludes with a dimension reduction and residual connections to minimize parameter count and computational footprint.

Through this design, the model effectively learns which dependency type (short-term or long-term) to trust more for each prediction point, adjusting dynamically in response to both input features and higher-level learned knowledge.

3. Empirical Performance and Robustness

ParallelTime, equipped with the ParallelTime Weighter, was benchmarked on eight standard multivariate time series datasets (including Electricity, Weather, Illness, Traffic, ETTh1, ETTh2, ETTm1, ETTm2) with varying prediction horizons (e.g., 96, 192, 336, 720 for most, shorter for Illness), using fixed look-back windows of $L=512$ .

Performance metrics included Mean Squared Error (MSE) and Mean Absolute Error (MAE). Across all benchmarks, ParallelTime consistently achieved lower error than competing state-of-the-art models such as PatchTST, FEDFormer, and DLinear. The model also demonstrated advantages in efficiency:

Inference and training FLOPs were reduced by approximately 35–44% compared to PatchTST.
Parameter counts were reduced by up to 86% in some scenarios.

Extensive ablation and robustness checks (including altering model depth, patch sizes, and random seeds) confirmed both the practical gains of the dynamic weighting approach and the model’s generalization stability.

4. Comparison to Prior and Contemporary Approaches

The primary distinction of the ParallelTime Weighter, relative to prior architectures, is its context-driven, input-adaptive balancing of dependency signals. Earlier models employing aggregated attention-Mamba outputs typically used a fixed (e.g., mean) or globally static fusion, making them less adaptable to regional input variability.

Advantages evidenced by the ParallelTime approach include:

Granular Adaptivity: Weightings are computed per token, allowing the model to privilege local or global patterns as the signal warrants.
Resource Efficiency: The use of compression layers and efficient fusion yields superior or equivalent prediction performance using significantly fewer parameters and computations.
Long-Horizon Scalability: The method retains forecasting accuracy as windows and prediction horizons increase, a critical attribute for wide-ranging applications.

This dynamic weighing mechanism enables nuanced temporal modeling that is less likely to overfit to patterns seen during training and more robust to regime shifts in time series data.

5. Practical Applications and Usage

The ParallelTime Weighter supplies direct utility for practitioners in domains requiring high-quality, low-latency multivariate time series predictions. Application scenarios include:

Financial forecasting: Adaptive dependency weighting can help in settings where either recent events (short-term) or economic cycles (long-term) episodically dominate.
Energy load and grid monitoring: Locally auto-correlated fluctuations (e.g., daily demand cycles) and global events (seasons, policy) are captured via the dynamic fusion.
Healthcare and epidemiology: Time series with heterogeneous granularity (e.g., symptom spikes vs. slow-moving trends) benefit from token-wise adjusted dependency weights.

The low computational cost and parameter count make the architecture suitable for deployment on limited-resource environments or real-time systems.

6. Future Directions and Prospects

Several forward-looking extensions are discussed:

Architectural scaling: Increasing the number of layers or adapting the model for higher-dimensional or multi-task settings.
Task expansion: Adapting the ParallelTime Weighter for anomaly detection, sequence classification, and multi-step or probabilistic forecasting.
Foundation model potential: The architecture’s design may serve as a core for future generalized time series models, supporting downstream fine-tuning for a variety of real-world analytics tasks.
Real-time systems: Given its efficiency, deploying ParallelTime Weighter-empowered models in streaming or latency-critical contexts is considered a viable direction.

7. Summary Table of Core Mechanism

Component	Role in Weighter	Mathematical Operation
x_att/x_mamba	Short/long-term representations	Produced by windowed attention/Mamba
RMSNorm + $W_{att}$ / $W_{mamba}$	Normalize & compress	$x' = \mathrm{RMSNorm}(x) W$
Concatenation + 2-layer	Fuse & adaptively weigh	Sigmoid(ReLU(Concat(x_att', x_mamba') $W_1$ ) $W_2$ )
Output Fusion	Final mixed signal	$x_{\text{out}} = x_{\text{att}} \cdot w_{\text{att}} + x_{\text{mamba}} \cdot w_{\text{mamba}}$

In summary, the ParallelTime Weighter is an adaptive, dynamically computed fusion mechanism central to achieving state-of-the-art results and resource efficiency in time series forecasting. Its per-token weighting strategy for short- and long-term dependencies is empirically validated across diverse datasets, architectures, and operating conditions, and opens avenues for further innovation in sequential, heterogeneous, and resource-constrained temporal modeling contexts (Katav et al., 18 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

ParallelTime: Dynamically Weighting the Balance of Short- and Long-Term Temporal Dependencies (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to ParallelTime Weighter.