PENGUIN: Periodic-Nested Group Attention
- The paper introduces a periodic-nested relative attention bias that wraps temporal distances by known periods, ensuring seasonally relevant lags are prioritized.
- It employs a grouped multi-query attention structure that assigns head groups to distinct periodicities and non-periodic trends, which disentangles overlapping cycles.
- PENGUIN achieves state-of-the-art reductions in MSE and MAE across nine LTSF benchmarks, demonstrating significant improvements over prior Transformer and MLP models.
Periodic-Nested Group Attention (PENGUIN) is a Transformer-based self-attention mechanism specifically designed to enhance long-term time series forecasting (LTSF) by effectively modeling strong periodic structures and multiple overlapping cycles within real-world time series data. PENGUIN introduces a periodic-nested relative attention bias and assigns groups of attention heads to distinct periodicities, utilizing a multi-query attention framework to disentangle and capture both periodic and non-periodic temporal dependencies, leading to improved predictive performance on standard LTSF benchmarks (Sun et al., 19 Aug 2025).
1. Motivation and Problem Setting
Long-term time series forecasting (LTSF) aims to predict future values in a window of length from a historical segment of length . Conventional Transformer architectures, when applied to LTSF, have several limitations: they tend to treat all relative lags uniformly or employ simple decaying bias schemes such as ALiBi, failing to adequately learn or separate multiple coexisting periodicities (e.g., daily and weekly cycles). This results in two predominant issues:
- Insufficient emphasis on seasonally relevant lags (e.g., a point 24h ago for hourly data).
- Poor disentanglement of overlapping cycles, which hinders accurate modeling of both long-term and short-term patterns.
PENGUIN addresses these limitations by introducing a periodic-nested relative attention bias, which wraps relative distances by known periods, and by grouping attention heads such that each group focuses on a separate period or a non-periodic trend. This design is tailored to both capture strong seasonal structure and preserve the ability to model local, non-periodic dependencies.
2. Periodic-Nested Relative Attention Bias
Let denote the number of input tokens (after patching). The dataset is assumed to have known natural periods, collected as . After patching the sequence with stride , each period induces a token-level period . One additional group is reserved for modeling a non-periodic trend ( total groups); attention heads are split evenly across groups, with heads per group.
For both periodic and non-periodic groups, the bias matrix is constructed as follows:
Non-Periodic (Linear) Bias
For the non-periodic group, the bias between tokens and for head is: This recapitulates ALiBi-style relative bias to prioritize locality.
Periodic Bias
For each periodic group associated with token-level period :
This folds temporal distances modulo the period length, ensuring that positions separated by an integer number of cycles receive zero penalty, thus promoting attention across identical phases of the cycle.
3. Grouped Query Attention Architecture
PENGUIN adopts a grouped multi-query attention structure over token embeddings . The attention heads are partitioned into groups of heads per group.
Attention Computation Within Each Group:
- Compute group-specific key and value projections () using dedicated linear transformations for each group .
- Each head belongs to group and uses slope .
- The attention logits for head are:
- Softmaxed attention weights select shared values :
Aggregation:
- Within each group , output representations from all heads are concatenated:
- Group outputs are concatenated and projected with :
Transformer Layer Update:
- A residual and normalization scheme is applied:
Remark on Efficiency: The use of group-wise multi-query (shared key/value) structure improves computational efficiency while maintaining group-specific specialization.
4. Implementation and Training Protocol
PENGUIN is evaluated on nine standard LTSF benchmarks, including ETTh1, ETTh2, ETTm1, ETTm2, Electricity, Exchange, Weather, Solar, and Traffic. Each dataset leverages known periodicities—typically daily (24), weekly (168) hours.
Data Processing:
- Reversible Instance Normalization (Revin) stabilizes non-stationarity per feature channel.
- Non-overlapping patches of length (default 16) and stride (default 8) yield tokens.
- Learnable positional embeddings are provided.
Model Hyperparameters:
- Hidden dimension .
- heads, groups, heads/group.
- Encoder depth layers.
- Adam optimizer, learning rate , batch size 32.
- Loss: mean squared error (MSE):
- RMSNorm is used in place of LayerNorm for increased stability.
- Sloped attention biases are fixed at initialization.
Masking:
- Decoder applies a causal mask (positions may attend only to earlier positions).
- Encoder is unmasked.
5. Performance and Comparative Analysis
Tables within (Sun et al., 19 Aug 2025) report that PENGUIN achieves state-of-the-art results across all benchmark datasets and forecast horizons (), with evaluations averaged over all nine datasets for mean squared error (MSE) and mean absolute error (MAE):
| Model | MSE | MAE |
|---|---|---|
| PENGUIN | 0.300 | 0.330 |
| CATS | 0.319 | 0.339 |
| CycleNet | 0.317 | 0.339 |
| PatchTST | 0.321 | 0.344 |
| DLinear | 0.331 | 0.359 |
| FEDformer | 0.437 | 0.436 |
| Autoformer | 0.523 | 0.481 |
PENGUIN demonstrates a 5.4% reduction in MSE compared to CycleNet (best prior MLP-based model) and a 6.0% reduction compared to CATS (best prior Transformer-based model). On individual datasets (e.g., Traffic), PENGUIN reduces MSE from 0.399 to 0.387 and MAE from 0.276 to 0.262 relative to CycleNet.
A plausible implication is that explicit incorporation of multi-periodicity and dedicated attention head groups yields significant advances in long-horizon forecasting accuracy, marking an advance over both standard MLP and Transformer baselines.
6. Key Innovations and Contributions
PENGUIN introduces two primary methodological advances for LTSF:
- The periodic-nested relative attention bias enables the model to learn and prioritize seasonally relevant dependencies (e.g., strong daily, weekly recurrences) by folding relative lags modulo the natural periods.
- The grouped multi-query attention structure assigns subsets of heads to specific periods (including one linear "expert"), enabling explicit modeling of multiple coexisting cycles while efficiently sharing key/value projections within groups.
This construction not only improves interpretability (by aligning head groups with known periodicities) but also provides parameter efficiency and improved generalization on non-stationary, seasonally-structured inputs.
7. Context and Significance
PENGUIN's design responds to shortcomings of prior attention mechanisms, which have struggled to explicitly leverage and disentangle multiple periodic patterns inherent in practical LTSF settings. Its improvements are substantiated by benchmark results across heterogeneous datasets with diverse periodic structures.
By providing a mechanism to directly encode domain-driven cyclic structure into the Transformer attention bias, PENGUIN bridges the methodological gap between general-purpose sequence models and the distinctive needs of time series forecasting. Its empirical superiority over leading MLP (CycleNet) and Transformer (CATS) architectures is consistently observed across standardized metrics and multiple benchmarks (Sun et al., 19 Aug 2025).