TimesFM-2.5: Patch-Based Transformer for Forecasting

Updated 29 June 2026

TimesFM-2.5 is a patch-based decoder-only transformer designed for time series forecasting, pretrained via next-patch regression on 100 billion time points.
It processes fixed-length context windows split into non-overlapping patches using linear embedding, causal transformer stacks, and autoregressive decoding for multi-step predictions.
Its performance is enhanced through GITCO, a system that mitigates context poisoning by dynamically intervening on anomalous patches to improve forecasting accuracy.

TimesFM-2.5 is a patch-based decoder-only transformer architecture designed as a foundation model for time series forecasting. It is pretrained on a large, multi-domain corpus comprising both real-world and synthetic time series, and evaluated for both generic forecasting and financial return prediction tasks. TimesFM-2.5 has been the subject of comprehensive benchmarking and is notable for its patching strategy, inductive prior interpretation, and susceptibility to context poisoning, motivating downstream interventions such as GITCO. This entry details TimesFM-2.5’s architecture, pretraining regime, inference properties, empirical performance, context sensitivity, and principal methodological considerations (Pandey et al., 3 Jun 2026, Alonso et al., 25 Jun 2026).

1. Architectural Structure

TimesFM-2.5 operates on fixed-length context windows of size $L=512$ , partitioned into $N=16$ non-overlapping patches of length $P=32$ , yielding an input tensor

$X \in \mathbb{R}^{N \times P} = [p_1, \ldots, p_N], \quad p_i \in \mathbb{R}^P.$

Each patch is processed as follows:

Linear Embedding: Each $p_i$ is mapped into a $d$ -dimensional token space.
Patch-Level Positional Encoding: Patch indices are encoded, ensuring order-awareness at the patch granularity.
Input Residual Block: Each patch token is normalized, potentially masked for padding, and passed through an MLP; the output is summed with the positional code.
Causal Transformer Stack: A stack of $L_{\mathrm{blocks}}$ transformer layers processes the sequence $(t_1, ..., t_N)$ . Each layer alternates masked self-attention and feed-forward sublayers with residual connections and normalization. Multi-head attention is used with $H$ heads, head dimension $d_k = d/H$ .
Autoregressive Forecasting: The model attends to the entire sequence of $N=16$ 0 patch tokens and outputs multi-step forecasts for each output patch.
Output Residual Block: The contextualized output token $N=16$ 1 is mapped to next-patch forecasts via a second residual MLP.

Notably, all passages from the causal transformer stack to the output remain strictly autoregressive.

2. Pretraining Regime and Objective

TimesFM-2.5 is pretrained using next-patch regression on a corpus of approximately 100 billion time points:

Real-world Sources: Google Trends and Wikipedia page view dynamics.
Synthetic Sources: Draws from ARMA processes and deterministic seasonal-trend compositions.
No Finance Data: Financial returns or asset data are not directly present in the pretraining set.

The pretraining objective is mean squared error for next-patch prediction:

$N=16$ 2

where $N=16$ 3 is the output patch length, potentially $N=16$ 4 to minimize the number of autoregressive decoding steps needed for long-term forecasting (Alonso et al., 25 Jun 2026).

3. Bayesian and Information-Theoretic Framing

Pretraining in TimesFM-2.5 facilitates transfer as an inductive prior. With pretrained parameters $N=16$ 5, the model defines a prior

$N=16$ 6

providing a PAC-Bayes guarantee on out-of-sample risk after adaptation:

$N=16$ 7

When TimesFM-2.5 is used zero-shot, the posterior $N=16$ 8 is tightly centered on $N=16$ 9, making adaptation negligible but leveraging the prior's structure.

Forecastability bounds are governed by mutual information $P=32$ 0:

$P=32$ 1

For daily-equity returns, $P=32$ 2 nats, implying that the theoretical ceiling on achievable $P=32$ 3 improvements is exceedingly low (Alonso et al., 25 Jun 2026).

4. Patch-Based Vulnerabilities and Context Sensitivity

The model’s patching strategy exposes it to “context poisoning”: anomalous patches (e.g., local level shifts, volatility bursts) may receive disproportionate attention in the transformer, degrading zero-shot accuracy without overt model malfunction. Such vulnerabilities are quantifiable:

Disruption by Single Patches: Attention geometry permits isolated patches to silently dominate downstream prediction.
No Weight Adaptation in Zero-Shot: As context is fixed, forecast quality is susceptible to localized outliers.

To address this, the GITCO (Gated Inference-Time Context Optimization) system was designed for TimesFM-2.5. GITCO augments inference via a Gate–Router–Critic pipeline:

Gate: A binary classifier (78.0% precision, 57.6% recall) using spectral features flags if patch intervention is safe.
Router: Selects among three critic MLPs (Shape, Stat, Uni experts) using meta-feature-based logic.
Critic: Computes per-patch disruption scores via MLP; the patch with the highest score is denoised by a 5-point SMA.

Only the most harmful patch is modified; model weights remain untouched. On 53 GIFT-Eval datasets, GITCO yields +1.95% mean MASE reduction (4.30% on intervened sets) and a captured improvement ratio (CIR) of 0.899, retrieving 89.9% of the oracle gain (Pandey et al., 3 Jun 2026).

5. Empirical Evaluation and Comparative Results

TimesFM-2.5 has been benchmarked in both generic forecasting and in financial return prediction:

Dataset Coverage: 53 datasets across frequencies from sub-hourly to monthly.
Metric: Mean Absolute Scaled Error (MASE).
Sliding-Window Evaluation: $P=32$ 4, stride 1, up to 300 windows/series, $P=32$ 5 cross-validation.
GITCO Ablations: Demonstrate criticality of high-precision gating and dynamic probe routing.

System Variant	ΣΔ% Improvement	Precision (%)
Always Intervene	+4.41	35.85
Gate Only	+24.83	45.83
Router Only	+42.16	37.74
GITCO (Gate+Router)	+57.33	78.0

Assets: AAPL, AMZN, GOOG, JPM, META; both linear and log returns.
Context: All models receive $P=32$ 6 past returns; horizon $P=32$ 7.
Metrics: Mean absolute error (MAE), skill vs. random walk, Diebold-Mariano hypothesis testing.
Findings: TimesFM-2.5 delivers best forecasts on AAPL and JPM, strong average ranks (3.1, only behind Moirai-2.0 at 2.9), but is outperformed by iTransformer on META. Gains over the random-walk benchmark are small and only rarely statistically significant.

Ticker	Return	Winner	MAE	Skill vs. RW
AAPL	Linear	TimesFM-2.5	0.01044	-0.0319
AAPL	Log	TimesFM-2.5	0.01040	-0.0288
JPM	Linear	TimesFM-2.5	0.01141	-0.0466
JPM	Log	TimesFM-2.5	0.01146	-0.0510
META	Linear	iTransformer	0.01709	-0.0519
META	Log	iTransformer	0.01693	-0.0464

A key result is that, despite strong “off-the-shelf” performance and ranking distribution dominance, TimesFM-2.5 and similar TSFMs yield modest absolute improvements in predictability, commensurate with low information-theoretic limits in asset returns.

6. Practical Implications and Limitations

TimesFM-2.5 establishes itself as a practical, domain-agnostic inductive prior in settings where task-specific data is limited. Its main practical advantages are:

Cost-Scaling: Reduction of per-asset model development when local data is insufficient.
Zero-Shot Adaptivity: Applicability without further fine-tuning or domain adaptation.
Explicit Context Sensitivity Profiles: Predictable improvements when paired with inference-time context optimization such as GITCO, which is model- and meta-feature-specific.

However:

Not Universally Dominant: Asset-specific or locally supervised methods (e.g., iTransformer) may outperform in domains with idiosyncratic patterns.
Performance Ceilings: Empirical and information-theoretic assessments demonstrate tight upper bounds in noisy, weakly persistent settings (e.g., financial returns).
Pretraining Corpus Limits: A plausible implication is that only domain-aligned pretrained priors and/or hybrid fine-tuning can reliably shift the KL term in a PAC-Bayes analysis and unlock consistently robust out-of-sample improvements.

7. Theoretical and Methodological Insights

TimesFM-2.5’s architectural and learning principles present several methodological takeaways:

Attention Geometry & Mixing: Transformer mixing times and the propagation of long-range dependencies are controlled by the spectral properties of the attention kernel (cf. Cheeger’s inequality).
Context Sensitivity Profiles: The mapping $P=32$ 8 from meta-features to expected improvement under context intervention is compact and model-specific for TimesFM-2.5, but cannot always be induced in alternative architectures (e.g., Chronos2), highlighting nuanced tradeoffs in transferability and robustness (Pandey et al., 3 Jun 2026).
No Free Lunch in Financial Returns: Despite strong multi-domain pretraining, the signal content in financial returns enforces small, regime-dependent practical improvements, clarifying the distinction between model rankings and economic significance (Alonso et al., 25 Jun 2026).

TimesFM-2.5 occupies a central position in time series foundation modeling—a scalable, patch-based transformer capable of state-of-the-art zero-shot forecasting under both generic and application-specific conditions, yet bounded by the interplay of architectural choices, information-theoretic constraints, and context sensitivity.

Markdown Report Issue Upgrade to Chat

References (2)

GITCO: Gated Inference-Time Context Optimization in TSFMs (2026)

Pretrained Time-Series Foundation Models for Financial Return Forecasting (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimesFM-2.5.

TimesFM-2.5: Patch-Based Transformer for Forecasting

1. Architectural Structure

2. Pretraining Regime and Objective

3. Bayesian and Information-Theoretic Framing

4. Patch-Based Vulnerabilities and Context Sensitivity

5. Empirical Evaluation and Comparative Results

GIFT-Eval Benchmarks (Pandey et al., 3 Jun 2026)

Financial Return Forecasting (Alonso et al., 25 Jun 2026)

6. Practical Implications and Limitations

7. Theoretical and Methodological Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TimesFM-2.5: Patch-Based Transformer for Forecasting

1. Architectural Structure

2. Pretraining Regime and Objective

3. Bayesian and Information-Theoretic Framing

4. Patch-Based Vulnerabilities and Context Sensitivity

5. Empirical Evaluation and Comparative Results

GIFT-Eval Benchmarks (Pandey et al., 3 Jun 2026)

Financial Return Forecasting (Alonso et al., 25 Jun 2026)

6. Practical Implications and Limitations

7. Theoretical and Methodological Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research