PENGUIN: Periodic-Nested Group Attention

Updated 23 January 2026

The paper introduces a periodic-nested relative attention bias that wraps temporal distances by known periods, ensuring seasonally relevant lags are prioritized.
It employs a grouped multi-query attention structure that assigns head groups to distinct periodicities and non-periodic trends, which disentangles overlapping cycles.
PENGUIN achieves state-of-the-art reductions in MSE and MAE across nine LTSF benchmarks, demonstrating significant improvements over prior Transformer and MLP models.

Periodic-Nested Group Attention (PENGUIN) is a Transformer-based self-attention mechanism specifically designed to enhance long-term time series forecasting (LTSF) by effectively modeling strong periodic structures and multiple overlapping cycles within real-world time series data. PENGUIN introduces a periodic-nested relative attention bias and assigns groups of attention heads to distinct periodicities, utilizing a multi-query attention framework to disentangle and capture both periodic and non-periodic temporal dependencies, leading to improved predictive performance on standard LTSF benchmarks (Sun et al., 19 Aug 2025).

1. Motivation and Problem Setting

Long-term time series forecasting (LTSF) aims to predict future values in a window of length $H$ from a historical segment of length $L$ . Conventional Transformer architectures, when applied to LTSF, have several limitations: they tend to treat all relative lags uniformly or employ simple decaying bias schemes such as ALiBi, failing to adequately learn or separate multiple coexisting periodicities (e.g., daily and weekly cycles). This results in two predominant issues:

Insufficient emphasis on seasonally relevant lags (e.g., a point 24h ago for hourly data).
Poor disentanglement of overlapping cycles, which hinders accurate modeling of both long-term and short-term patterns.

PENGUIN addresses these limitations by introducing a periodic-nested relative attention bias, which wraps relative distances by known periods, and by grouping attention heads such that each group focuses on a separate period or a non-periodic trend. This design is tailored to both capture strong seasonal structure and preserve the ability to model local, non-periodic dependencies.

2. Periodic-Nested Relative Attention Bias

Let $N$ denote the number of input tokens (after patching). The dataset is assumed to have $p$ known natural periods, collected as $\mathcal{P}=\{\mathcal{P}_1,\dots,\mathcal{P}_p\}$ . After patching the sequence with stride $S$ , each period $\mathcal{P}_r$ induces a token-level period $\mathcal{P}_S^{(r)} = \mathcal{P}_r / S$ . One additional group is reserved for modeling a non-periodic trend ( $g = p + 1$ total groups); attention heads are split evenly across groups, with $n = h/g$ heads per group.

For both periodic and non-periodic groups, the bias matrix is constructed as follows:

Non-Periodic (Linear) Bias

For the non-periodic group, the bias between tokens $i$ and $j$ for head $k$ is: $\mathcal{B}_{ij}^{(k)} = -m_k \bigl|i-j\bigr|, \qquad m_k = 2^{-\frac{8}{k}}, \; k=1,\dots,n$ This recapitulates ALiBi-style relative bias to prioritize locality.

Periodic Bias

For each periodic group associated with token-level period $\mathcal{P}_S^{(r)}$ : $u = |i-j| \bmod \mathcal{P}_S^{(r)}$

$\widehat{\mathcal{B}}_{ij}^{(k)} = \begin{cases} u, & u < \frac{1}{2}\mathcal{P}_S^{(r)} \ \mathcal{P}_S^{(r)} - u, & u \geq \frac{1}{2}\mathcal{P}_S^{(r)} \end{cases}$

$\mathcal{B}_{ij}^{(k)} = -m_k \widehat{\mathcal{B}}_{ij}^{(k)}$

This folds temporal distances modulo the period length, ensuring that positions separated by an integer number of cycles receive zero penalty, thus promoting attention across identical phases of the cycle.

3. Grouped Query Attention Architecture

PENGUIN adopts a grouped multi-query attention structure over token embeddings $X \in \mathbb{R}^{N \times d}$ . The $h$ attention heads are partitioned into $g$ groups of $n$ heads per group.

Attention Computation Within Each Group:

Compute group-specific key and value projections ( $K^r, V^r$ ) using dedicated linear transformations for each group $r$ .
Each head $H = 1 \dots h$ belongs to group $r = \lceil H/n \rceil$ and uses slope $m_k$ .
The attention logits for head $H$ are:

$A_H = \frac{Q^H (K^r)^{\top}}{\sqrt{d_h}} + \mathcal{B}_r^{(k)}$

Softmaxed attention weights $\alpha_H$ select shared values $V^r$ :

$\alpha_H = \mathrm{Softmax}(A_H)$

$\mathrm{head}_H = \alpha_H V^r$

Aggregation:

Within each group $r$ , output representations from all $n$ heads are concatenated:

$Y^r = [\mathrm{head}_{(r-1)n+1} \| \ldots \| \mathrm{head}_{rn}]$

Group outputs are concatenated and projected with $W_O$ :

$\mathrm{PENGUIN}(X) = \mathrm{Concat}(Y^1, \dots, Y^g) W_O$

Transformer Layer Update:

A residual and normalization scheme is applied:

$X' = X + \mathrm{RMSNorm}(\mathrm{PENGUIN}(X;\mathcal{P}_S))$

$X'' = X' + \mathrm{RMSNorm}(\mathrm{FeedForward}(X'))$

Remark on Efficiency: The use of group-wise multi-query (shared key/value) structure improves computational efficiency while maintaining group-specific specialization.

4. Implementation and Training Protocol

PENGUIN is evaluated on nine standard LTSF benchmarks, including ETTh1, ETTh2, ETTm1, ETTm2, Electricity, Exchange, Weather, Solar, and Traffic. Each dataset leverages known periodicities—typically daily (24), weekly (168) hours.

Data Processing:

Reversible Instance Normalization (Revin) stabilizes non-stationarity per feature channel.
Non-overlapping patches of length $P$ (default 16) and stride $S$ (default 8) yield $N$ tokens.
Learnable positional embeddings are provided.

Model Hyperparameters:

Hidden dimension $d=256$ .
$h=12$ heads, $g=p+1$ groups, $n = h/g$ heads/group.
Encoder depth $E = 3-4$ layers.
Adam optimizer, learning rate $1 \times 10^{-4}$ , batch size 32.
Loss: mean squared error (MSE):

$\mathcal{L}_{\mathrm{MSE}} = \frac{1}{H} \sum_{t=1}^H \|\widehat{x}_t - x_t\|_2^2$

RMSNorm is used in place of LayerNorm for increased stability.
Sloped attention biases $m_k$ are fixed at initialization.

Masking:

Decoder applies a causal mask (positions may attend only to earlier positions).
Encoder is unmasked.

5. Performance and Comparative Analysis

Tables within (Sun et al., 19 Aug 2025) report that PENGUIN achieves state-of-the-art results across all benchmark datasets and forecast horizons ( $H \in \{96, 192, 336, 720\}$ ), with evaluations averaged over all nine datasets for mean squared error (MSE) and mean absolute error (MAE):

Model	MSE	MAE
PENGUIN	0.300	0.330
CATS	0.319	0.339
CycleNet	0.317	0.339
PatchTST	0.321	0.344
DLinear	0.331	0.359
FEDformer	0.437	0.436
Autoformer	0.523	0.481

PENGUIN demonstrates a 5.4% reduction in MSE compared to CycleNet (best prior MLP-based model) and a 6.0% reduction compared to CATS (best prior Transformer-based model). On individual datasets (e.g., Traffic), PENGUIN reduces MSE from 0.399 to 0.387 and MAE from 0.276 to 0.262 relative to CycleNet.

A plausible implication is that explicit incorporation of multi-periodicity and dedicated attention head groups yields significant advances in long-horizon forecasting accuracy, marking an advance over both standard MLP and Transformer baselines.

6. Key Innovations and Contributions

PENGUIN introduces two primary methodological advances for LTSF:

The periodic-nested relative attention bias enables the model to learn and prioritize seasonally relevant dependencies (e.g., strong daily, weekly recurrences) by folding relative lags modulo the natural periods.
The grouped multi-query attention structure assigns subsets of heads to specific periods (including one linear "expert"), enabling explicit modeling of multiple coexisting cycles while efficiently sharing key/value projections within groups.

This construction not only improves interpretability (by aligning head groups with known periodicities) but also provides parameter efficiency and improved generalization on non-stationary, seasonally-structured inputs.

7. Context and Significance

PENGUIN's design responds to shortcomings of prior attention mechanisms, which have struggled to explicitly leverage and disentangle multiple periodic patterns inherent in practical LTSF settings. Its improvements are substantiated by benchmark results across heterogeneous datasets with diverse periodic structures.

By providing a mechanism to directly encode domain-driven cyclic structure into the Transformer attention bias, PENGUIN bridges the methodological gap between general-purpose sequence models and the distinctive needs of time series forecasting. Its empirical superiority over leading MLP (CycleNet) and Transformer (CATS) architectures is consistently observed across standardized metrics and multiple benchmarks (Sun et al., 19 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Periodic-Nested Group Attention (PENGUIN).