Papers
Topics
Authors
Recent
2000 character limit reached

Re(Visiting) Time Series Foundation Models in Finance (2511.18578v1)

Published 23 Nov 2025 in q-fin.CP, cs.AI, cs.LG, q-fin.PM, and q-fin.PR

Abstract: Financial time series forecasting is central to trading, portfolio optimization, and risk management, yet it remains challenging due to noisy, non-stationary, and heterogeneous data. Recent advances in time series foundation models (TSFMs), inspired by LLMs, offer a new paradigm for learning generalizable temporal representations from large and diverse datasets. This paper presents the first comprehensive empirical study of TSFMs in global financial markets. Using a large-scale dataset of daily excess returns across diverse markets, we evaluate zero-shot inference, fine-tuning, and pre-training from scratch against strong benchmark models. We find that off-the-shelf pre-trained TSFMs perform poorly in zero-shot and fine-tuning settings, whereas models pre-trained from scratch on financial data achieve substantial forecasting and economic improvements, underscoring the value of domain-specific adaptation. Increasing the dataset size, incorporating synthetic data augmentation, and applying hyperparameter tuning further enhance performance.

Summary

  • The paper examines TSFMs for financial forecasting by benchmarking zero-shot, fine-tuned, and in-domain pre-training methods.
  • It finds that domain-specific pre-training and data augmentation significantly improve forecasting metrics over generic TSFMs.
  • The study shows that while ensemble methods remain robust, optimized TSFMs excel in long-horizon predictions with extended data contexts.

Revisiting Time Series Foundation Models in Financial Forecasting

Introduction

The paper "Re(Visiting) Time Series Foundation Models in Finance" (2511.18578) presents a systematic empirical assessment of Time Series Foundation Models (TSFMs) for financial forecasting, focusing on daily cross-sectional excess returns. The authors address three central questions: the out-of-sample predictability of excess returns using TSFMs, the impact of zero-shot inference, fine-tuning, and in-domain pre-training, and the effects of data scale and augmentation. The study benchmarks TSFMs against traditional econometric, ML, and ensemble models using one of the largest financial datasets ever compiled, encompassing over two billion observations across global equity markets.

Empirical Evaluation of Predictive Models

Benchmark Models

The study rigorously compares a broad range of models: linear regressions (OLS, Lasso, Ridge, Elastic Net, PCR), tree-based ensemble methods (XGBoost, CatBoost, LightGBM), and fully-connected neural networks (NN-S, NN-L). Across varying rolling window lengths, the ensemble models, particularly CatBoost, consistently outperform both linear and deep neural models on out-of-sample ROOS2R^2_{OOS}, directional accuracy, and Sharpe ratios. For example, CatBoost achieves an annualized return of 46.50% and a Sharpe ratio of 6.79 at a 252-day window, compared to linear regression's ROOS2=−0.47%R^2_{OOS} = -0.47\%. Predictive performance diminishes with increasing market capitalization: smaller-cap equities exhibit stronger signals.

Time Series Foundation Models (TSFMs)

The main analysis evaluates TSFMs using three regimes:

  • Zero-shot inference using generic off-the-shelf models.
  • Fine-tuning pre-trained models on financial data.
  • Pre-training from scratch on financial time series.

The primary TSFM baselines are Chronos [ansari2024chronos] and TimesFM [das2024DecoderonlyFoundationModel], with additional comparisons to over a dozen models from recent literature.

Main Findings: Limits and Opportunities of TSFMs

Zero-Shot and Fine-Tuned Models

Off-the-shelf pre-trained TSFMs (e.g., Chronos-Large, TimesFM-500M) perform poorly on financial forecasting tasks, with ROOS2R^2_{OOS} between −1.37%-1.37\% and −2.80%-2.80\% using long lookbacks and out-of-sample directional accuracy near 50%. Portfolio returns from such models are far inferior to benchmark ensemble models, and fine-tuning does not close this gap—especially in economic outcomes. Minimal benefit is observed from directional realignment: most fine-tuned TSFMs either show negligible or negative gains in ROOS2R^2_{OOS}, and any improvements are not matched by economic significance.

In-Domain Pre-Training

Pre-training TSFMs entirely on financial data provides a substantial uplift: for instance, Chronos-Small's ROOS2R^2_{OOS} improves to −0.59%-0.59\% for a 512-day window (from −1.27%-1.27\% zero-shot), and annualized return and Sharpe ratio rise to 36.84% and 5.42. This narrows, but does not erase, the gap with ensemble benchmarks (CatBoost at ROOS2=−0.03%R^2_{OOS} = -0.03\% with annualized return 47.25%, Sharpe 6.46 at 512 days). The improvement is larger for longer lookbacks: TSFMs exploit temporal context more effectively than tabular methods as the context window grows.

The evidence is unequivocal: generic LLM-style TSFM pre-training fails to transfer to financial domains. Domain-native pre-training with sufficient data scale is necessary for robust out-of-sample prediction. This trend persists across global financial markets.

Data Scale and Augmentation Effects

Increasing the size and scope of financial training data—through international cross-section expansion or augmentation with synthetic data and financial factors—delivers additive gains for TSFMs. Notably, TSFMs pre-trained on global or factor-augmented datasets advance in both ROOS2R^2_{OOS} and Sharpe ratio, with TimesFM-20M surpassing ensemble benchmarks at long lookbacks when using JKP-augmented data (Sharpe ratio 7.69 at window 512).

Surprisingly, synthetic augmentation approaches match or exceed the effect of real factor augmentation, indicating that TSFM sample efficiency is driven by heterogeneity and volume rather than strict domain realism—assuming the base pre-training set is financial.

Hyperparameter Sensitivity and Portfolio Robustness

TSFM performance is highly sensitive to architectural and tokenizer hyperparameters (e.g., quantization range in Chronos, patch size in TimesFM). By proper tuning, even small TSFMs can match or outperform larger, default-scale models. Importantly, portfolio performance is more resilient for TSFMs at longer lookbacks: while all models show Sharpe decline over 2001–2023, the degradation is slower for finance-specialized TSFMs.

When transaction costs are introduced, only the largest, augmented TSFMs (e.g., TimesFM-20M) demonstrate resilience relative to tree ensembles at long horizons.

International Generalization

Results across seven major non-US markets confirm the US findings: ensemble models provide the most reliable baseline, yet TSFMs become more competitive as domain-specific pre-training and data scale increase. Predictive power is market-dependent, and cross-sectional ranking ability (for portfolio construction) is more robust to differences in market microstructure than goodness-of-fit.

Implications and Future Directions

The results strongly challenge the transferability hypothesis for foundation models: domain-agnostic pre-training is insufficient for financial forecasting. TSFMs, if pre-trained on large-scale, domain-appropriate data, can reach and, with scale and augmentation, surpass classic ensemble methods for long-range time series forecasting—especially as input context length increases.

From a practical perspective, the computational resource requirement for pre-training TSFMs is significant but tractable with modern infrastructure—however, there is a sharp performance/cost inflection that favors data and hyperparameter optimization over naive parameter scaling.

Theoretically, the study calls for a reexamination of scaling laws and pre-training data construction for financial time series, the development of new TSFM architectures specifically designed for economic series (e.g., capturing regime shifts, cross-sectional linkages, and heteroskedastic dynamics), and exploration of joint training with features/factors, order book/signal inputs, and multivariate tasks.

Given TSFMs' inherent capability for probabilistic simulation, future research should rigorously evaluate their utility for risk management under uncertainty, probabilistic scenario analysis, and longer-horizon asset allocation problems.

Conclusion

This comprehensive empirical analysis establishes the current limitations and conditional potential of TSFMs in financial forecasting. Ensemble tree methods remain the dominant practical baseline, but TSFMs, when scaled using domain-specific pre-training and data augmentation, match or outperform classic approaches for long-horizon, high-dimensional prediction and portfolio construction. For robust, domain-relevant time series forecasting and economic decision-making, TSFMs cannot substitute for (nor be substituted by) cross-domain pre-training; in-domain pre-training and appropriate data scaling are both necessary and sufficient for competitive performance, particularly as market efficiency increases and predictive signals attenuate over time.

The findings inform both researchers building new financial forecasting systems and practitioners seeking robust, scalable, and future-proof predictive infrastructure in asset management, trading, and risk analytics. The release of all models and code as open-source resources will facilitate further research and rapid progress in aligning TSFM advances with the nuanced needs of financial time series prediction.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 18 likes about this paper.