Pretrained Time-Series Foundation Models for Financial Return Forecasting

Published 25 Jun 2026 in q-fin.MF | (2606.27100v1)

Abstract: Financial return forecasting is a difficult test case for time-series foundation models (TSFMs) due to low signal-to-noise ratios, structural breaks, heavy tails, and weak persistence. This paper benchmarks pretrained TSFMs against train-from-scratch neural baselines in a deliberately conservative financial setting. We evaluate TimeGPT/TimeGPT-LH, TimesFM-2.5, Moirai-2.0, Chronos, and Chronos-2 against NBEATS, NHITS, PatchTST, iTransformer, and KAN on five liquid U.S. equities (AAPL, AMZN, GOOG, JPM, META) using linear and log returns. Models are compared under an equalized context budget, a rolling-origin protocol, and against random-walk benchmarks. We provide a theoretical framing of pretraining as an inductive prior, linking PAC-Bayes transfer intuition, information-theoretic predictability limits, and attention geometry. This clarifies why strong model rankings need not imply economically meaningful predictability in noisy markets. Pragmatically, pretrained TSFMs dominate the ranking distribution, accounting for 8 of 10 task-level wins. Moirai-2.0 and TimesFM-2.5 achieve the strongest average ranks, leading tasks for AAPL, JPM, GOOG, and AMZN, while Chronos wins the remaining AMZN task. However, the iTransformer baseline wins both META tasks, showing local supervised learning can still outperform generic pretraining for specific assets. Crucially, gains over the random-walk benchmark are small and sparse. A one-sided Diebold-Mariano test rejects equal or inferior predictive accuracy only for Chronos on AMZN and Moirai-2.0 on GOOG. We conclude that TSFMs serve as useful practical priors that reduce model-development costs in low-data financial forecasting, but are not universal engines for statistically reliable alpha generation in realistic empirical deployment.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a rigorous benchmark comparing pretrained TSFMs with train-from-scratch models using a rolling-origin evaluation on US equities.
The methodology integrates information-theoretic and operator-theoretic analyses, demonstrating that pretraining serves as an effective inductive prior in low-data regimes.
The results indicate that pretrained models like Moirai-2.0 and TimesFM-2.5 generally outperform baselines, although asset-specific dynamics can influence performance.

Pretrained TSFMs for Financial Return Forecasting: An Authoritative Benchmark

Introduction

The paper "Pretrained Time-Series Foundation Models for Financial Return Forecasting" (2606.27100) presents a rigorous benchmark of state-of-the-art pretrained time-series foundation models (TSFMs) for equity return prediction. The evaluation is set in a conservative, single-ticker, low-data regime, emphasizing practical model deployment for liquid U.S. equities (AAPL, AMZN, GOOG, JPM, META). The benchmark compares six pretrained TSFMs—TimeGPT, TimeGPT-LH, TimesFM-2.5, Moirai-2.0, Chronos, Chronos-2—with five train-from-scratch neural baselines—NBEATS, NHITS, PatchTST, iTransformer, KAN—using a standardized context length ( $L=512$ ), a rolling-origin evaluation, and error metrics referenced to random-walk and seasonal-naive benchmarks.

Theoretical Framework and Methodological Design

The work provides an integrative theoretical framing for TSFM evaluation, interpreting pretraining as an inductive prior informed by PAC-Bayes and information-theoretic principles. The argument highlights that pretrained TSFMs, pretrained on massive heterogeneous corpora, act as practical priors by restricting effective model complexity and variance in small-sample contexts. This shifts the generalization advantage toward zero-shot models, particularly when local finance-specific data are sparse.

The paper methodically elaborates:

Information-theoretic bounds on financial return predictability, quantifying the conditional mutual information ( $I(r_{t+h}; \mathcal{F}_t)$ ) available for forecasting, which is empirically minuscule in daily liquid equities.
Operator-theoretic and geometric perspectives on self-attention and patching, relating architectural bias to context mixing and score distribution geometry.
A clear empirical risk minimization protocol, enforcing unified forecast horizon ( $H=20$ ), equal context budget ( $L=512$ ), and stringent reproducibility standards.

Model Architectures

Pretrained TSFMs: The evaluated models include decoder-only transformers (TimesFM-2.5, Moirai-2.0), quantized language-of-time models (Chronos, Chronos-2), and transfer-learning frameworks (TimeGPT variants). Moirai-2.0 distinguishes itself through quantile forecasting heads and multi-quantile decoding, while Chronos architectures utilize categorical likelihoods over quantized bins.
Neural Baselines: The baselines cover established MLP-style decomposition models (NBEATS, NHITS), transformer-based PatchTST and iTransformer, and spline-based Kolmogorov-Arnold Networks (KAN), all trained individually per asset.

All models are assessed via point forecast median extraction, with probabilistic outputs mapped to MAE-compatible statistics.

Experimental Results and Statistical Evaluation

The empirical findings are direct: pretrained TSFMs dominate the ranking, securing 8 of 10 task-level wins. Moirai-2.0 and TimesFM-2.5 achieve the lowest average ranks (2.9 and 3.1, respectively). TimesFM-2.5 excels in AAPL and JPM forecasting, Moirai-2.0 leads GOOG and AMZN log-return, and Chronos wins a single AMZN task. Exceptionally, iTransformer—a locally-trained baseline—outperforms all TSFMs for both META tasks, underscoring the regime and asset dependency of TSFM advantage.

Numerical Highlights:

Positive skill scores versus the random-walk benchmark are sparse; for example, Moirai-2.0 yields a skill score of $0.2289$ for GOOG linear returns, representing the highest reported improvement.
One-sided Diebold-Mariano tests yield significant skill only for Chronos on AMZN (p=0.0421) and Moirai-2.0 on GOOG (p=0.0421); otherwise, the null hypothesis of equal or inferior accuracy is not rejected.
Most skill scores are near zero or negative, reflecting the tight information-theoretic ceiling on achievable prediction accuracy.

Analysis and Implications

Practical Implications:

Pretrained TSFMs provide robust practical priors in low-data settings, minimizing per-ticker development effort.
Out-of-the-box deployment offers competitive benchmarks in the absence of extensive fine-tuning or finance-native pretraining, streamlining pipeline setup for individual asset prediction tasks.

Theoretical Implications:

The empirical evidence and PAC-Bayes intuition reinforce that pretraining grants most value when local sample size is tiny relative to model capacity; as local data increases or finance-specific pretraining is introduced, this advantage diminishes.
No model family is universally superior; asset-specific dynamics, volatility regimes, and context mixing influence ranking outcomes.

Economic Interpretation:

Gains over random-walk and seasonal-naive benchmarks are limited, with MAE differences near the noise floor of daily return series. There is no statistical evidence for consistent out-of-sample alpha generation.
The paper cautions against conflating statistical superiority with trading utility; robust investment advantage requires further calibration, backtesting, and domain alignment.

Limitations and Opportunities for Future Work

Four principal limitations are noted:

Not a pure architecture comparison; as-used deployment is the only benchmarked scenario.
Evaluation windows are limited; noise in daily returns constrains statistical significance.
LLM-based time-series adaptation (Time-LLM) is not evaluated empirically, though referenced conceptually.
The evaluation targets forecast error, not trading performance; directional and economic metrics are absent.

The results align with contemporaneous findings on TSFMs: generic pretraining is beneficial in low-data regimes but finance-specific pretraining is required for robust alpha extraction and portfolio optimization.

Future Directions:

Incorporate direct finance-native pretraining and cross-sectional portfolio prediction tasks.
Extend evaluation window and include trading utility, turnover constraints, and robust backtests.
Benchmark LLM reprogramming approaches empirically against TSFMs and neural baselines.

Conclusion

The benchmark substantiates pretrained TSFMs as strong practical defaults for low-data financial return forecasting, with Moirai-2.0 and TimesFM-2.5 leading aggregate rankings. Supervised baselines such as iTransformer remain competitive for specific assets, precluding universal TSFM superiority. The evidence supports temporal pretraining as a useful inductive prior, but strict information-theoretic limits prevent reliable alpha generation in realistic deployment. TSFMs efficiently reduce model-development overhead, but the fundamental difficulty of financial return prediction persists.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper asks a simple, practical question: if you want to predict short‑term stock returns and you don’t have much data or time, is it better to use a big, already‑trained time‑series model “out of the box,” or to train your own model from scratch for each stock?

In plain terms, it compares “pretrained” time‑series foundation models (models that learned general patterns from tons of past time series) to strong build‑it‑yourself models, to see which does better at forecasting daily returns for a few well‑known U.S. stocks over the next 20 trading days.

What questions were the researchers trying to answer?

They focused on two easy‑to‑grasp questions:

If you only have one history per stock, need 20‑day‑ahead forecasts, and don’t want to spend a lot of time tuning models, are large pretrained time‑series models a better choice than training your own?
Even if some models “rank” higher than others, do they actually beat a very simple strategy that assumes tomorrow’s return is zero (the “random walk” guess) by a meaningful amount?

Why this is tricky: daily stock returns are hard to predict. The “signal” (helpful pattern) is tiny compared to the “noise” (randomness), big market changes can happen suddenly (“structural breaks”), and rare big moves (“heavy tails”) make mistakes costly. That means even smart models may not do much better than a very simple guess.

How did they test it? (Methods in simple terms)

They created a fair, simple testing setup and tried different models on the same tasks.

The task: predict the next 20 business days of returns for five popular U.S. stocks: Apple (AAPL), Amazon (AMZN), Alphabet/Google (GOOG), JPMorgan (JPM), and Meta (META). They tried two ways of measuring return: normal “linear” returns and “log” returns (just two standard ways to measure percent change).
The data each model could use: a fixed “look‑back” window of 512 daily observations (about two years of trading days). Everyone got the same amount of history.
The evaluation: a “rolling‑origin” setup, which means they walked forward through time, repeatedly using past data to predict the next 20 days, like how you’d forecast in real life.
The baseline to beat: a “random walk” guess that predicts zero return every day. This is surprisingly hard to beat in daily stock data.
The models:
- Pretrained time‑series foundation models: TimeGPT, TimeGPT‑Long Horizon, TimesFM‑2.5, Moirai‑2.0, Chronos, and Chronos‑2.
- Train‑from‑scratch baselines: NBEATS, NHITS, PatchTST, iTransformer, and KAN.

They measured errors mainly with Mean Absolute Error (MAE), which is like asking, “On average, how far off were your predictions?” They also used a statistical test (called the Diebold–Mariano test) to check whether any model’s advantage over the simple baseline was real and not just luck.

A few terms explained:

Pretraining: like giving a model lots of practice on many different time series so it learns general patterns before you ask it to forecast a specific stock.
Context window (512): how far back the model is allowed to look.
Rolling‑origin: forecasting step‑by‑step as new days arrive, instead of peeking at the future.
Random walk: predicting no change; for daily stock returns this is tough to beat consistently.

What did they find?

The short version: pretrained models often ranked higher, but actual gains over the simple baseline were small and only rarely “statistically solid.”

Here are the highlights:

Pretrained models did well in the rankings: they won 8 out of the 10 tasks (five stocks × two return types).
- TimesFM‑2.5 and Moirai‑2.0 were the strongest on average.
- TimesFM‑2.5 led on AAPL and JPM tasks.
- Moirai‑2.0 led on GOOG tasks and one AMZN task.
- Chronos won the other AMZN task.
But a locally trained model (iTransformer) beat all others on both META tasks, showing that a tailored, per‑stock model can still be best for a specific case.
Most importantly, the improvements over the simple “predict zero” baseline were tiny and scattered. The stricter statistical test only confirmed clear, reliable gains in two cases:
- Chronos on AMZN.
- Moirai‑2.0 on GOOG.
In other words, while big pretrained models often look good on leaderboards, the actual edge they provide in daily return forecasting is small—and often not reliably better than a very simple guess.

Why this matters: daily stock returns carry very little predictable information from one day to the next. That means there’s just not much “signal” for any model to grab, so even fancy models can only improve a little, and sometimes not in a statistically meaningful way.

Why does this matter? (Implications)

Good news for practitioners with little data: pretrained time‑series models are useful starting points. They act like a “smart prior,” giving you reasonable forecasts without lots of custom training, which can save time and effort.
Caution for traders: don’t expect miracles. In realistic conditions, these models did not consistently beat the simple baseline by a large or reliable margin. Tiny improvements can disappear after transaction costs and other real‑world frictions.
No one‑size‑fits‑all: sometimes a carefully trained local model (like iTransformer on META) can outperform general‑purpose pretrained models for a specific stock or time period.
Big picture: in noisy markets with little day‑to‑day predictability, strong rankings don’t automatically mean strong profits. Pretrained models help reduce development work, but they are not a universal engine for easy “alpha” (extra returns).

In simple terms: pretrained time‑series models are handy tools that give you a head start, especially when you don’t have much data per stock. But because daily stock moves are mostly noise, even the best models only squeeze out small gains—and those gains aren’t always reliable enough to bank on.

View Paper Prompt View All Prompts

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

The paper provides a careful, conservative benchmark but leaves several concrete issues unaddressed that future research could tackle:

Asset and market coverage: Validate results on a broader universe (small/mid-caps, non-U.S. equities, ETFs, FX, rates, commodities, illiquid names) to assess external validity beyond five large-cap U.S. equities.
Frequency and horizon generalization: Evaluate intraday, weekly, and monthly data; vary forecast horizons (shorter than 20 days and much longer) and report horizon-wise performance profiles.
Regime-conditioned performance: Stratify results by volatility regimes, crisis/calm periods, and structural breaks (e.g., via change-point detection) to quantify regime sensitivity.
Covariates and multivariate settings: Test whether incorporating exogenous signals (e.g., VIX, macro releases, sector indices, volumes) or multivariate/cross-asset inputs improves TSFM transfer and baselines.
Cross-sectional tasks: Compare results on per-asset time-series forecasting with cross-sectional excess-return prediction tasks to reconcile with finance-native pretraining findings.
Zero-shot vs fine-tuning vs finance-pretraining: Empirically map performance vs local sample size for (i) zero-shot, (ii) few-shot fine-tuning, and (iii) finance-specific pretraining to test the PAC-Bayes/inductive-prior predictions.
Context-length sensitivity: Systematically vary context L beyond 512 to probe whether long-context machinery in TSFMs yields material gains on returns.
Hyperparameter fairness and sensitivity: Move beyond “as-used” defaults—jointly tune baselines and TSFMs (patch length, embedding size, layers, learning rates) and quantify sensitivity to these choices.
Alternative naive baselines: Add volatility-scaled random walk, historical mean, ARIMA/ARIMAX, mean-reversion, and GARCH as reference points to contextualize MAE gains.
Return definitions and transformations: Evaluate excess returns (over risk-free), volatility-normalized returns, and residuals after factor models (e.g., Fama–French) to test robustness to representation choices.
Probabilistic evaluation: For models with quantiles/probabilities (Moirai, Chronos, TimeGPT-LH), report CRPS, calibration curves, quantile coverage, and tail quantile loss (e.g., 0.95 VaR) in addition to MAE.
Economic utility and costs: Translate forecast gains into backtested strategies with position-sizing, transaction costs, slippage, and capacity limits; report Sharpe, turnover, drawdowns, and Kelly/log-utility growth to test the causal-capacity implications.
Multiple testing and inference robustness: Apply multiple-comparison corrections across models/assets/horizons, report HAC-robust DM statistics, and include bootstrap/block-bootstrap confidence intervals and effect sizes.
Fine-grained horizon reporting: Show per-h (h = 1…20) performance, and compare direct, recursive, and multi-output forecasting strategies to identify where gains concentrate.
Normalization and leakage checks: Clearly document and test normalization choices (e.g., RevIN vs z-score/robust scaling) for leakage; run ablations to quantify their impact on each model.
Reproducibility and versioning: Release code, seeds, exact data windows, and the specific TSFM checkpoints/API versions used to enable independent replication.
Compute and latency trade-offs: Measure inference time, memory, and cost for each model; quantify the cost–accuracy frontier relevant for practitioners.
Attention/representation diagnostics: Empirically analyze attention patterns, token mixing, and patch representations (e.g., spectral/conductance proxies) to support or refute the operator-theoretic interpretations.
Information-theoretic link to skill: Estimate mutual information I(r_{t+h}; F_t) and its regime variation on the studied assets to empirically connect predictability bounds with observed skill scores.
Pretraining corpus ablations: Test how adding finance data (or excluding it) in pretraining affects transfer; assess domain contamination risks and the necessity of finance-native pretraining for this task.
Ensemble and hybrid methods: Evaluate whether stacking/blending TSFMs with local models (e.g., ARIMA/GARCH, KAN) or volatility models improves robustness and economic outcomes.
Directional metrics: Report sign accuracy, directional DM tests, and utility-weighted accuracy to complement magnitude-based MAE.
Tail robustness: Compare robust losses (Huber/Tukey) and heavy-tail-aware objectives; assess sensitivity to extreme-return days and outliers.
Evaluation-window robustness: Re-run analyses across alternative start/end dates and rolling step sizes to test the stability of conclusions.
Online/adaptive learning: Compare static zero-shot deployment with online adaptation or small adapters/LoRA to handle nonstationarity without full fine-tuning.
Calibration and post-processing: Explore simple post-hoc calibration (e.g., isotonic regression, shrinkage toward zero, volatility-aware rescaling) to convert small statistical gains into more reliable economic signals.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed now using off‑the‑shelf pretrained TSFMs (e.g., TimesFM‑2.5, Moirai‑2.0, Chronos/Chronos‑2, TimeGPT/TimeGPT‑LH) and the paper’s conservative evaluation workflow.

Finance (Industry)

Rapid baseline forecasting for single-asset, low-data settings
- Use case: Zero‑shot 20‑business‑day return forecasts for new or thin‑history tickers to triage effort and set realistic expectations.
- Tools/workflows: TSFM API clients; standardized preprocessing (RevIN/normalization, patching); rolling‑origin evaluation with fixed context length L=512; MAE vs zero‑return random‑walk baseline; one‑sided Diebold–Mariano (DM) significance test.
- Assumptions/dependencies: Access to TSFM inference (API or weights); daily frequency; univariate returns; small signal‑to‑noise; forecast horizon H=20; results show small average gains, with significance only in a minority of cases.
Model selection and governance “harness”
- Use case: Compare off‑the‑shelf TSFMs to strong local baselines (e.g., iTransformer, KAN) asset‑by‑asset and regime‑by‑regime; route production to the winner where DM rejects equal or inferior accuracy vs random walk.
- Tools/workflows: Automated harness implementing rolling‑origin, equalized context budgets, MAE skill vs random walk, and DM tests; asset/regime dashboards.
- Assumptions/dependencies: Local baselines tuned within comparable budgets; limited generalizability beyond the tested scope (five U.S. equities, daily).
Uncertainty‑aware risk reporting
- Use case: Produce forecast intervals for scenario analysis, stress tests, and limit setting using TSFMs with built‑in uncertainty (TimeGPT‑LH conformal intervals, Moirai‑2.0 quantiles).
- Tools/workflows: Conformal wrappers; quantile outputs; downstream VaR/ES and P&L distribution feeds; interval coverage monitoring.
- Assumptions/dependencies: Intervals are for statistical uncertainty, not tail‑event guarantees; heavy tails and regime shifts persist; calibration must be monitored.
Cold‑start and monitoring for new assets
- Use case: Deploy TSFMs as practical priors when local data are limited, then hand‑off to local models if/when data accrues and they outperform (e.g., META‑like cases where iTransformer won).
- Tools/workflows: “Prior‑to‑local” routing with re‑tests on a cadence; alerting when DM significance emerges or disappears.
- Assumptions/dependencies: Data accrual and regime variability; inference cost vs benefit trade‑offs.
Guardrails for trading signal deployment
- Use case: Block or down‑weight signals if forecasts do not beat the random walk with statistical significance; cap leverage using “information budget” heuristics (acknowledging small achievable R²).
- Tools/workflows: Signal gating by DM tests; “information budget” dashboards summarizing achievable improvement ceilings; policy documentation for SR 11‑7 style model risk controls.
- Assumptions/dependencies: Low mutual information in daily returns implies small economic edge even for best models; transaction costs can erase small gains.

Software/MLOps (Industry)

Standardized evaluation toolkit for time‑series forecasters
- Use case: Provide an internal library implementing the paper’s conservative benchmark protocol as the default sanity check across business units.
- Tools/workflows: Rolling‑origin splits; equal context budgeting; MAE vs naive baseline; DM tests; reproducible reports.
- Assumptions/dependencies: Organization‑wide adoption; consistent data engineering.
Parameter‑efficient fine‑tuning with KL‑budget guardrails
- Use case: Apply adapters/LoRA with small local datasets while constraining deviation from pretrained weights (inductive‑prior/PAC‑Bayes intuition).
- Tools/workflows: Fine‑tuning that monitors a proxy “distance” from the pretrained prior; early stopping; small learning rates.
- Assumptions/dependencies: Benefits likely marginal in very small samples; may require access to weights rather than black‑box APIs.

Academia

Reproducible benchmarking modules for courses and labs
- Use case: Teach forecasting difficulty in finance and the importance of naive baselines and DM tests.
- Tools/workflows: Course notebooks implementing the protocol; assignments contrasting TSFM zero‑shot vs local baselines.
- Assumptions/dependencies: Access to public TSFMs or substitutes; licensing for instructional use.
Extension studies under controlled budgets
- Use case: Evaluate probabilistic scores (CRPS, quantile coverage) and robustness to heavy tails; expand to other horizons/frequencies.
- Tools/workflows: Add CRPS and calibration diagnostics to the benchmark; stress‑test under volatility regimes.
- Assumptions/dependencies: Curated datasets; consistent preprocessing; transparent hyperparameter budgets.

Policy and Regulation

Model validation guidance referencing naive benchmarks and statistical tests
- Use case: Require comparisons to random‑walk baselines and DM‑style significance for claims about predictive accuracy in liquid markets.
- Tools/workflows: Supervisory templates capturing baseline comparisons, horizon‑specific tests, and uncertainty communication.
- Assumptions/dependencies: Regulator and industry consensus; adaptation for other asset classes/frequencies.

Daily Life / Small and Medium Businesses

Zero‑shot forecasting for non‑financial, low‑data series
- Use case: Quick demand, traffic, or inventory forecasts where local data are scarce (leveraging TSFMs’ multi‑domain pretraining).
- Tools/workflows: Lightweight pipelines with normalization/patching; compare to seasonal‑naive/random‑walk; incorporate conformal intervals.
- Assumptions/dependencies: Domain shift from training corpora; check that naive seasonal baselines are included; finance‑specific limitations (low predictability) do not necessarily apply in other domains.

Long-Term Applications

These applications require additional research, scaling, finance‑native pretraining, or broader organizational and policy development.

Finance (Industry)

Finance‑native TSFMs for cross‑sectional and multivariate forecasting
- Use case: Pretrain on large finance‑specific corpora (billions of observations) to improve portfolio construction and risk forecasts beyond zero‑shot generic TSFMs.
- Tools/workflows: Domain‑aligned pretraining pipelines; inclusion of microstructure, fundamentals, macro covariates; joint point‑and‑distributional objectives (CRPS, pinball).
- Assumptions/dependencies: Significant compute and data licensing; privacy/compliance constraints; expected gains tied to domain alignment.
Automated model routing by asset/regime
- Use case: Meta‑learners that dynamically select TSFM vs local models (e.g., iTransformer) per asset and period based on online significance, stability, and cost.
- Tools/workflows: Bandit‑style or Bayesian model averaging; DM‑based gates; regime classifiers (volatility, structural breaks).
- Assumptions/dependencies: Reliable online testing; change‑point detection; transaction‑cost‑aware integration.
Capital allocation governed by “causal capacity” estimates
- Use case: Use information‑theoretic ceilings to set leverage/risk budgets and avoid overfitting small edges.
- Tools/workflows: Proxies for mutual/directed information; integration with Kelly‑style or risk‑parity sizing under constraints.
- Assumptions/dependencies: Estimating information reliably in non‑stationary, heavy‑tailed settings; robust to misspecification.
Attention diagnostics for regime monitoring
- Use case: Use operator/spectral heuristics from attention to flag slow/fast mixing regimes, potential structural breaks, or over‑compression of context.
- Tools/workflows: Layer/head analytics; spectral-gap proxies; alarms when attention patterns become near block‑diagonal.
- Assumptions/dependencies: Access to attention weights; theory is heuristic and needs empirical validation.

Software/MLOps (Industry)

KL‑budgeted fine‑tuning frameworks
- Use case: Enterprise tooling that constrains fine‑tuning distance from pretrained priors to improve generalization with small local datasets (operationalizing PAC‑Bayes insights).
- Tools/workflows: Regularizers or trust‑region training that penalize divergence from pretrained weights; automated selection of KL budgets.
- Assumptions/dependencies: Requires weight access; calibration of KL proxies to real‑world outcomes.
Open registries and scorecards for time‑series foundation models
- Use case: Curate standardized, audited evaluations across domains with equalized budgets and naive baselines; publish reproducible leaderboards.
- Tools/workflows: Data/model cards; versioned pipelines; governance around data provenance.
- Assumptions/dependencies: Community coordination; legal clarity on data/model distribution.

Academia

Estimation of “effective prior dispersion” and Fisher‑geodesic distances
- Use case: Quantify how close pretrained priors are to finance‑optimal manifolds; predict when fine‑tuning can help.
- Tools/workflows: Fisher information approximations; influence functions; empirical Bayes.
- Assumptions/dependencies: Access to gradients/weights; scalable approximations for large models.
Better predictability and information‑budget measurement
- Use case: Methods to estimate mutual/directed information in heavy‑tailed, non‑stationary series; regime‑specific ceilings on achievable R² and utility.
- Tools/workflows: Nonparametric estimators; bootstrap under dependence; tail‑robust metrics.
- Assumptions/dependencies: Theoretical advances for dependence and heavy tails; large samples.
Distributional evaluation standards beyond point MAE
- Use case: Normalize the use of CRPS, calibration curves, and quantile coverage for financial forecasting research.
- Tools/workflows: Benchmarks that report both point and probabilistic scores; harmonized horizons and contexts.
- Assumptions/dependencies: Agreement on metrics; reproducible data splits.

Policy and Regulation

Standards for foundation‑model use in financial forecasting
- Use case: Guidelines mandating baseline comparisons, statistical significance reporting, uncertainty communication, and limits on marketing claims.
- Tools/workflows: Supervisory checklists; third‑party audits of models and datasets; disclosure of pretraining corpora where feasible.
- Assumptions/dependencies: Industry engagement; balancing transparency with IP concerns.
Risk‑sensitive deployment rules tied to information ceilings
- Use case: Encourage or require risk budgeting proportional to estimated predictability/information content of the task.
- Tools/workflows: Regulatory templates for “information budget” reporting; scenario‑based stress test add‑ons.
- Assumptions/dependencies: Accepted methods for estimating information under non‑stationarity.

Daily Life / SMB and Other Sectors (Healthcare, Energy, Retail, Education)

Universal probabilistic forecasters for low‑data operations
- Use case: Sector‑specific TSFMs providing zero‑shot point and quantile forecasts of patient volumes, energy loads, sales, or learner activity.
- Tools/workflows: Domain‑fine‑tuned TSFMs; conformal wrappers; simple MLOps for small organizations.
- Assumptions/dependencies: Domain alignment improves over generic pretraining; data privacy and compliance in sensitive sectors; baseline comparisons remain essential.
Curriculum and upskilling programs on realistic AI forecasting
- Use case: Teach practitioners and students how to benchmark against naive models, interpret uncertainty, and recognize limits in low‑signal domains.
- Tools/workflows: Short courses; open datasets; reproducible pipelines based on this paper’s protocol.
- Assumptions/dependencies: Institutional support; accessible tooling.

Notes on feasibility across applications:

The paper’s evidence is for daily U.S. equity returns, single‑asset, H=20, L=512. Extrapolations to other assets, frequencies, or tasks should be validated.
Gains over naive benchmarks are small and often not statistically significant; economic significance after costs is likely smaller still.
Generic TSFMs act as practical priors in low‑data settings but are not universal alpha engines; finance‑native pretraining is likely necessary for material improvements in trading contexts.
Access to model weights (vs black‑box APIs) materially affects the ability to fine‑tune, analyze, and govern models.

View Paper Prompt View All Prompts

Glossary

Absolute loss: A point forecast error measure minimizing the conditional median of the target variable. "For $\ell(\hat r, r) = |\hat r - r|$ , the Bayes-optimal predictor is the conditional median"
ARMA: Autoregressive moving-average model family for time-series modeling combining AR and MA components. "and taking its modern form with the systematic ARMA framework of \citet{boxjenkins1970timeseriesanalysisforecastingandcontrol}."
Bayes-optimal predictor: The function that minimizes expected loss given the data, e.g., the conditional mean for squared loss. "For $\ell(\hat r, r) = (\hat r - r)^2$ , the Bayes-optimal predictor is the conditional mean"
Cheeger's inequality: A bound relating conductance of a Markov kernel to its spectral gap and mixing behavior. "Under the reversible idealization, Cheeger's inequality \citep{lawlersokal1988bounds} relates the gap to the conductance of the kernel,"
Conductance (of a kernel): A measure of how easily probability mass flows between subsets under a Markov kernel; low conductance implies slow mixing. "relates the gap to the conductance of the kernel,"
Conformal prediction: A distribution-free method to produce prediction intervals with finite-sample validity. "In TimeGPT Long Horizon (TimeGPT-LH) scenarios, uncertainty is quantified through a non-parametric conformal prediction framework."
Continuous ranked probability score (CRPS): A strictly proper scoring rule evaluating full predictive distributions by integrating quantile losses. "is a finite-sample approximation to the continuous ranked probability score (CRPS),"
Cross-modality reprogramming: Adapting models pretrained on one modality (e.g., text) to another (e.g., time series) via learned transformations. "The cross-modality reprogramming approach pioneered by Time-LLM \citep{jin2024timellmtimeseriesforecasting} leverages the high-dimensional latent representations of frozen LLMs"
Decoder-only (transformer): An architecture using only the autoregressive decoder stack to model sequences and generate forecasts. "TimesFM is a decoder-only foundation model for time-series forecasting"
Diebold–Mariano test: A statistical test comparing predictive accuracy between two forecasting models. "A one-sided Diebold--Mariano test rejects equal or inferior predictive accuracy only for Chronos on AMZN and Moirai-2.0 on GOOG."
Directed information: An information-theoretic measure capturing causal information flow from past to future in a sequence. "where the right-hand side is the directed (causal) information from the conditioning sequence to the return sequence in the sense of \citet{kramer1998directed, permuter2011interpretations},"
Doob decomposition: The unique splitting of an adapted process into a martingale and a predictable finite-variation component. "By the Doob decomposition, $R_t = M_t + A_t$ "
Fisher information metric: A Riemannian metric on a statistical model’s parameter manifold capturing local curvature of the log-likelihood. "The parameter manifold of any probabilistic forecaster carries a natural Riemannian metric, the Fisher information metric"
Instance normalization (RevIN): Normalization technique applied per-instance (time series) to stabilize distribution shifts; RevIN is a reversible variant. "These sequences undergo normalization (typically RevIN) and are segmented into overlapping patches"
Kelly–Cover growth-optimal strategy: The log-utility maximizing investment rule that maximizes long-run growth rate under given predictive information. "The Kelly--Cover growth-optimal strategy \citep{kelly1956new, coverthomas2006elements} maximizes the conditional expected log-return,"
Kolmogorov–Arnold parameterization: A representation inspired by the Kolmogorov–Arnold theorem, here implemented via spline-based networks for function approximation. "KAN brings spline-based Kolmogorov--Arnold parameterization to time-series prediction"
Martingale-difference hypothesis: The assumption that future increments have zero conditional mean given the past. "The martingale-difference hypothesis $\mathbb{E}[r_{t+h} \mid \mathcal{F}_t] = 0$ corresponds to $A \equiv 0$ ,"
Mutual information: The reduction in uncertainty about one variable given knowledge of another; here between future returns and past information. "let $I(r_{t+h}; \mathcal{F}_t)$ denote the (Shannon) mutual information between the future return and the past $\sigma$ -field,"
Natural filtration: The increasing sequence of sigma-fields generated by a process up to each time. "adapted to its natural filtration $\mathcal{F}_t = \sigma(r_s : s \le t)$ ."
Natural gradient: An optimization direction preconditioned by the Fisher metric, invariant to smooth reparameterizations. "and induces the natural-gradient direction $g^{-1}\nabla \mathcal{L}$ that is invariant under smooth reparameterization"
PAC-Bayes bound: A generalization bound that controls true risk via empirical risk and a KL divergence between posterior and prior over hypotheses. "A more refined generalization argument tailored to the pretraining setting is the PAC-Bayes bound"
Path signature: An infinite series of iterated integrals summarizing a path, characteristic of the path up to tree-like equivalence. "the signature of $X$ is the formal series of iterated integrals"
Pinball loss: The loss function used for quantile regression whose minimizer is the conditional quantile. "The model predicts $n_q = 9$ quantile levels $Q = \{0.1, 0.2, \dots, 0.9\}$ by minimizing the pinball loss"
Proper scoring rule (strictly proper): A loss where truthful probabilistic forecasts uniquely minimize expected score; “strictly proper” ensures uniqueness. "Pinball loss as a strictly proper scoring rule."
Random-walk benchmark: Using a no-change (or zero-return) forecast as a baseline for evaluating predictive models. "error metrics benchmarked against naive random-walk alternatives."
Rate–distortion bound: An information-theoretic lower bound linking achievable MSE to mutual information via rate–distortion theory. "The classical rate--distortion lower bound on minimum mean-squared error \citep{coverthomas2006elements} gives"
Rolling-origin protocol: An evaluation scheme that rolls the training/validation split forward through time for sequential forecasting. "All neural models are compared under an equalized context budget of $L=512$ observations, a rolling-origin protocol, and error metrics benchmarked against naive random-walk alternatives."
Rough paths: A framework extending calculus to irregular paths; here used to interpret patch representations via path signatures. "admits a rough-paths interpretation that clarifies what information patches do and do not encode."
Row-stochastic matrix: A matrix whose rows sum to one, representing transition probabilities in a Markov kernel. "defines a row-stochastic matrix $K \in \mathbb{R}^{N \times N}$ over tokens."
Semimartingale: A broad class of stochastic processes decomposable into a martingale and finite-variation process; a foundation for modern stochastic calculus. "Semimartingale framework."
Spectral gap: The difference between the largest and second-largest eigenvalues (in modulus) governing mixing speed of a Markov operator. "it has at least one stationary distribution $\mu$ , and in the reversible idealization its spectral gap controls the mixing bound"
Wasserstein distance: An optimal-transport metric on probability distributions measuring minimal mass transport cost. "the $p$ -Wasserstein distance between probability measures $F, G$ on $\mathbb{R}$ with finite $p$ -th moments has the one-dimensional quantile form"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Pretrained Time-Series Foundation Models for Financial Return Forecasting

Summary

Pretrained TSFMs for Financial Return Forecasting: An Authoritative Benchmark

Introduction

Theoretical Framework and Methodological Design

Model Architectures

Experimental Results and Statistical Evaluation

Analysis and Implications

Limitations and Opportunities for Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers trying to answer?

How did they test it? (Methods in simple terms)

What did they find?

Why does this matter? (Implications)

Knowledge Gaps

Unresolved knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Finance (Industry)

Software/MLOps (Industry)

Academia

Policy and Regulation

Daily Life / Small and Medium Businesses

Long-Term Applications

Finance (Industry)

Software/MLOps (Industry)

Academia

Policy and Regulation

Daily Life / SMB and Other Sectors (Healthcare, Energy, Retail, Education)

Glossary

Open Problems

Continue Learning

Collections

Tweets