Papers
Topics
Authors
Recent
Search
2000 character limit reached

TimesFM-2.5: Patch-Based Transformer for Forecasting

Updated 29 June 2026
  • TimesFM-2.5 is a patch-based decoder-only transformer designed for time series forecasting, pretrained via next-patch regression on 100 billion time points.
  • It processes fixed-length context windows split into non-overlapping patches using linear embedding, causal transformer stacks, and autoregressive decoding for multi-step predictions.
  • Its performance is enhanced through GITCO, a system that mitigates context poisoning by dynamically intervening on anomalous patches to improve forecasting accuracy.

TimesFM-2.5 is a patch-based decoder-only transformer architecture designed as a foundation model for time series forecasting. It is pretrained on a large, multi-domain corpus comprising both real-world and synthetic time series, and evaluated for both generic forecasting and financial return prediction tasks. TimesFM-2.5 has been the subject of comprehensive benchmarking and is notable for its patching strategy, inductive prior interpretation, and susceptibility to context poisoning, motivating downstream interventions such as GITCO. This entry details TimesFM-2.5’s architecture, pretraining regime, inference properties, empirical performance, context sensitivity, and principal methodological considerations (Pandey et al., 3 Jun 2026, Alonso et al., 25 Jun 2026).

1. Architectural Structure

TimesFM-2.5 operates on fixed-length context windows of size L=512L=512, partitioned into N=16N=16 non-overlapping patches of length P=32P=32, yielding an input tensor

XRN×P=[p1,,pN],piRP.X \in \mathbb{R}^{N \times P} = [p_1, \ldots, p_N], \quad p_i \in \mathbb{R}^P.

Each patch is processed as follows:

  • Linear Embedding: Each pip_i is mapped into a dd-dimensional token space.
  • Patch-Level Positional Encoding: Patch indices are encoded, ensuring order-awareness at the patch granularity.
  • Input Residual Block: Each patch token is normalized, potentially masked for padding, and passed through an MLP; the output is summed with the positional code.
  • Causal Transformer Stack: A stack of LblocksL_{\mathrm{blocks}} transformer layers processes the sequence (t1,...,tN)(t_1, ..., t_N). Each layer alternates masked self-attention and feed-forward sublayers with residual connections and normalization. Multi-head attention is used with HH heads, head dimension dk=d/Hd_k = d/H.
  • Autoregressive Forecasting: The model attends to the entire sequence of N=16N=160 patch tokens and outputs multi-step forecasts for each output patch.
  • Output Residual Block: The contextualized output token N=16N=161 is mapped to next-patch forecasts via a second residual MLP.

Notably, all passages from the causal transformer stack to the output remain strictly autoregressive.

2. Pretraining Regime and Objective

TimesFM-2.5 is pretrained using next-patch regression on a corpus of approximately 100 billion time points:

  • Real-world Sources: Google Trends and Wikipedia page view dynamics.
  • Synthetic Sources: Draws from ARMA processes and deterministic seasonal-trend compositions.
  • No Finance Data: Financial returns or asset data are not directly present in the pretraining set.

The pretraining objective is mean squared error for next-patch prediction:

N=16N=162

where N=16N=163 is the output patch length, potentially N=16N=164 to minimize the number of autoregressive decoding steps needed for long-term forecasting (Alonso et al., 25 Jun 2026).

3. Bayesian and Information-Theoretic Framing

Pretraining in TimesFM-2.5 facilitates transfer as an inductive prior. With pretrained parameters N=16N=165, the model defines a prior

N=16N=166

providing a PAC-Bayes guarantee on out-of-sample risk after adaptation:

N=16N=167

When TimesFM-2.5 is used zero-shot, the posterior N=16N=168 is tightly centered on N=16N=169, making adaptation negligible but leveraging the prior's structure.

Forecastability bounds are governed by mutual information P=32P=320:

P=32P=321

For daily-equity returns, P=32P=322 nats, implying that the theoretical ceiling on achievable P=32P=323 improvements is exceedingly low (Alonso et al., 25 Jun 2026).

4. Patch-Based Vulnerabilities and Context Sensitivity

The model’s patching strategy exposes it to “context poisoning”: anomalous patches (e.g., local level shifts, volatility bursts) may receive disproportionate attention in the transformer, degrading zero-shot accuracy without overt model malfunction. Such vulnerabilities are quantifiable:

  • Disruption by Single Patches: Attention geometry permits isolated patches to silently dominate downstream prediction.
  • No Weight Adaptation in Zero-Shot: As context is fixed, forecast quality is susceptible to localized outliers.

To address this, the GITCO (Gated Inference-Time Context Optimization) system was designed for TimesFM-2.5. GITCO augments inference via a Gate–Router–Critic pipeline:

  • Gate: A binary classifier (78.0% precision, 57.6% recall) using spectral features flags if patch intervention is safe.
  • Router: Selects among three critic MLPs (Shape, Stat, Uni experts) using meta-feature-based logic.
  • Critic: Computes per-patch disruption scores via MLP; the patch with the highest score is denoised by a 5-point SMA.

Only the most harmful patch is modified; model weights remain untouched. On 53 GIFT-Eval datasets, GITCO yields +1.95% mean MASE reduction (4.30% on intervened sets) and a captured improvement ratio (CIR) of 0.899, retrieving 89.9% of the oracle gain (Pandey et al., 3 Jun 2026).

5. Empirical Evaluation and Comparative Results

TimesFM-2.5 has been benchmarked in both generic forecasting and in financial return prediction:

  • Dataset Coverage: 53 datasets across frequencies from sub-hourly to monthly.
  • Metric: Mean Absolute Scaled Error (MASE).
  • Sliding-Window Evaluation: P=32P=324, stride 1, up to 300 windows/series, P=32P=325 cross-validation.
  • GITCO Ablations: Demonstrate criticality of high-precision gating and dynamic probe routing.
System Variant ΣΔ% Improvement Precision (%)
Always Intervene +4.41 35.85
Gate Only +24.83 45.83
Router Only +42.16 37.74
GITCO (Gate+Router) +57.33 78.0
  • Assets: AAPL, AMZN, GOOG, JPM, META; both linear and log returns.
  • Context: All models receive P=32P=326 past returns; horizon P=32P=327.
  • Metrics: Mean absolute error (MAE), skill vs. random walk, Diebold-Mariano hypothesis testing.
  • Findings: TimesFM-2.5 delivers best forecasts on AAPL and JPM, strong average ranks (3.1, only behind Moirai-2.0 at 2.9), but is outperformed by iTransformer on META. Gains over the random-walk benchmark are small and only rarely statistically significant.
Ticker Return Winner MAE Skill vs. RW
AAPL Linear TimesFM-2.5 0.01044 -0.0319
AAPL Log TimesFM-2.5 0.01040 -0.0288
JPM Linear TimesFM-2.5 0.01141 -0.0466
JPM Log TimesFM-2.5 0.01146 -0.0510
META Linear iTransformer 0.01709 -0.0519
META Log iTransformer 0.01693 -0.0464

A key result is that, despite strong “off-the-shelf” performance and ranking distribution dominance, TimesFM-2.5 and similar TSFMs yield modest absolute improvements in predictability, commensurate with low information-theoretic limits in asset returns.

6. Practical Implications and Limitations

TimesFM-2.5 establishes itself as a practical, domain-agnostic inductive prior in settings where task-specific data is limited. Its main practical advantages are:

  • Cost-Scaling: Reduction of per-asset model development when local data is insufficient.
  • Zero-Shot Adaptivity: Applicability without further fine-tuning or domain adaptation.
  • Explicit Context Sensitivity Profiles: Predictable improvements when paired with inference-time context optimization such as GITCO, which is model- and meta-feature-specific.

However:

  • Not Universally Dominant: Asset-specific or locally supervised methods (e.g., iTransformer) may outperform in domains with idiosyncratic patterns.
  • Performance Ceilings: Empirical and information-theoretic assessments demonstrate tight upper bounds in noisy, weakly persistent settings (e.g., financial returns).
  • Pretraining Corpus Limits: A plausible implication is that only domain-aligned pretrained priors and/or hybrid fine-tuning can reliably shift the KL term in a PAC-Bayes analysis and unlock consistently robust out-of-sample improvements.

7. Theoretical and Methodological Insights

TimesFM-2.5’s architectural and learning principles present several methodological takeaways:

  • Attention Geometry & Mixing: Transformer mixing times and the propagation of long-range dependencies are controlled by the spectral properties of the attention kernel (cf. Cheeger’s inequality).
  • Context Sensitivity Profiles: The mapping P=32P=328 from meta-features to expected improvement under context intervention is compact and model-specific for TimesFM-2.5, but cannot always be induced in alternative architectures (e.g., Chronos2), highlighting nuanced tradeoffs in transferability and robustness (Pandey et al., 3 Jun 2026).
  • No Free Lunch in Financial Returns: Despite strong multi-domain pretraining, the signal content in financial returns enforces small, regime-dependent practical improvements, clarifying the distinction between model rankings and economic significance (Alonso et al., 25 Jun 2026).

TimesFM-2.5 occupies a central position in time series foundation modeling—a scalable, patch-based transformer capable of state-of-the-art zero-shot forecasting under both generic and application-specific conditions, yet bounded by the interplay of architectural choices, information-theoretic constraints, and context sensitivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimesFM-2.5.