Do Transformers Have the Ability for Periodicity Generalization?

Published 30 Jan 2026 in cs.LG and cs.AI | (2601.22690v1)

Abstract: LLMs based on the Transformer have demonstrated strong performance across diverse tasks. However, current models still exhibit substantial limitations in out-of-distribution (OOD) generalization compared with humans. We investigate this gap through periodicity, one of the basic OOD scenarios. Periodicity captures invariance amid variation. Periodicity generalization represents a model's ability to extract periodic patterns from training data and generalize to OOD scenarios. We introduce a unified interpretation of periodicity from the perspective of abstract algebra and reasoning, including both single and composite periodicity, to explain why Transformers struggle to generalize periodicity. Then we construct Coper about composite periodicity, a controllable generative benchmark with two OOD settings, Hollow and Extrapolation. Experiments reveal that periodicity generalization in Transformers is limited, where models can memorize periodic data during training, but cannot generalize to unseen composite periodicity. We release the source code to support future research.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that Transformer models achieve high in-distribution accuracy (~95%) yet fail to generalize composite periodic patterns in out-of-distribution scenarios (dropping to 23%).
It employs a group-theoretic framework to differentiate sequence periodicity from rule periodicity, highlighting limitations in RoPE positional encoding for non-commutative invariants.
The introduction of the Coper dataset provides a controlled evaluation benchmark, showing that increasing model scale or data density only marginally improves OOD generalization.

The Periodicity Generalization Limits of Transformers

Abstract Algebraic Framework for Periodicity in Reasoning

The paper "Do Transformers Have the Ability for Periodicity Generalization?" (2601.22690) systematically investigates the inability of Transformer-based architectures, including LLMs, to achieve robust periodicity generalization, particularly for composite periodic structures that are straightforward for humans. The work advances a unified interpretation of periodicity grounded in group theory, distinguishing between sequence periodicity (pattern repetition in inputs) and rule periodicity (repeated application of computational rules across positions or values), and generalizing this to composite periodicity where multiple periodic operations interact. The group-theoretic treatment clarifies the types of invariants captured by positional encoding schemes such as RoPE, and rigorously demonstrates why Transformers, without architectural modification, suffer from limitations in generalizing periodic patterns, especially in OOD and compositional settings.

Dataset and Evaluation Design: Coper for Composite Periodicity

To empirically diagnose these limitations, the authors introduce the Coper dataset, a fully generative, controllable benchmark for composite periodicity. Each sample is produced by element-wise modular addition (and, in variants, more general operations such as convolution and alternating add/subtract rules) of periodic sequences with individually parametrized lengths and moduli. The test set construction stratifies samples into in-distribution, Hollow (combinations “interpolating” between seen period pairs), and Extrapolation (outside the training period ranges). Crucially, this allows isolation of models’ ability to interpolate and extrapolate periodic rules, rather than raw memorization of instantiations seen during training.

Analytical and Numerical Evidence: Failure Modes

Evaluation across multiple architectures (vanilla Transformer with RoPE, FANformer, Mamba, RWKV, and TFKAN) exposes a robust inability of existing models to generalize even simplest composite periodicity outside of their training distribution. Numerical results are stark: for the canonical modular addition task, Transformer in-distribution accuracy approaches 95%, while Hollow and Extrapolation accuracy plummet to 30% and 23%, respectively—even for models, such as FANformer, specifically engineered for time-series periodicity. Mamba and RWKV perform worse in fitting regime but do not improve OOD generalization. These results are stable across increased data densities (shrinking interpolation hollows), increased model scale (deeper/wider networks), and alternate composite rules (circular convolution, add/subtract alternation); none of these strategies bridge the gap in OOD or compositional periodicity generalization.

Theoretical Underpinning: Relative vs. Rule Invariance

Central to the analysis is the distinction between what rotary positional encodings (RoPE) encode—relative compositional invariance—and what periodic generalization in algorithmic or mathematical tasks demands—rule periodicity invariance that is not encoded by relative positional differences alone. The authors formalize, via group actions, that for sequence periodicity (invariant to translation: $f(t+T) = f(t)$ ), RoPE allows the model to exploit shift invariance, and thus interpolate/extrapolate seen period patterns. In contrast, for rule periodicity (e.g., position-wise modular operations with an independent period), the group actions involved are non-commutative and cannot be captured by functions of relative positions. Proof by counterexample, together with extensive empirical results, validates this claim: RoPE-augmented Transformers fail in scenarios that do not reduce to relative sequence translation invariance, including all composite periodicity benchmarks advanced in the paper.

Implications for Model Scaling and Data Regime

Extensive scaling studies show that increasing either the density of observed period pairs (data sampling) or capacity (layers/parameters) yields only incremental improvement in the Hollow regime, and negligible improvement in Extrapolation. This quantitatively demonstrates that periodicity generalization is not a question of effective capacity or data coverage but an architectural limitation. As the combinatorial space of composite period pairs grows, saturating performance through brute-force data or model expansion becomes computationally intractable.

Robustness Across Periodicity Task Variations

Demonstration of these failures is not limited to single composition rules or representations. The same limitations hold for tasks involving circular convolution (another canonical signal-processing operation with well-defined periodic signatures), as well as hidden (distributional, not pointwise) periodicity such as learning $y = \sin(x)$ over tokenized input representations. Even advanced models such as TFKAN, designed for nonlinear and frequency domain periodicity, do not overcome the OOD failures in these settings.

Theoretical and Practical Implications

This research underscores a significant theoretical gap in the inductive biases of current Transformer-based models. While these models are capable of strong in-distribution performance due to memorization and interpolation over seen data, they lack abstract compositional reasoning necessary for generalizing invariants under group actions corresponding to complex periodic structure. The results suggest architectural changes are necessary to represent rule-based compositionality and OOD generalization—potentially via external reasoning modules, rule-based inductive bias, or more advanced position and transformation encodings.

Practically, any application requiring inference, prediction, or completion in combinatorially novel but algorithmically simple periodic regimes (e.g., in mathematical, physical, or engineering domains) will be fundamentally limited by the use of unmodified Transformer-based systems.

Future Directions

There is substantial room for future work: (1) extending the group-theoretic approach to a broader class of reasoning and compositional generalization tasks; (2) integrating external or inductive reasoning mechanisms such as chain-of-thought or rule induction; (3) designing new architectures capable of explicitly representing and inferring periodic compositional rules, possibly with modifications at the attention or positional encoding level; and (4) evaluating these developments with precise OOD generalization benchmarks as in Coper.

Conclusion

This paper achieves a rigorous, formal and empirical demonstration of the inability of present-day Transformer-based models and their variants to generalize composite periodicity, even when in-distribution generalization is strong. Both theoretical arguments—grounded in group theory and the nature of Transformer positional encodings—and extensive empirical analyses converge on the same conclusion: architectural limitations preclude true periodicity generalization in OOD and compositional settings. Overcoming this limitation will require foundational modification of the underlying model architectures and their inductive biases.

Markdown