Financial Foundation Models

Updated 4 July 2026

Financial Foundation Models are specialized systems that learn transferable representations from heterogeneous financial data to support diverse tasks like market analysis and compliance.
They are categorized into FinLFMs, FinTSFMs, and FinVLFMs, with each family optimized through techniques like continual pretraining, sequence forecasting, and multimodal alignment.
Empirical evidence shows that while FFMs excel in tasks such as tail-risk forecasting and market microstructure simulation, they often require domain adaptation and rigorous calibration for risk-sensitive applications.

Financial Foundation Models (FFMs) are domain-specialized foundation models intended to learn broadly transferable representations from large, heterogeneous financial data and to reuse those representations across assets, regimes, and downstream tasks with limited task-specific engineering. In the recent literature, the term spans Financial Language Foundation Models (FinLFMs), Financial Time-Series Foundation Models (FinTSFMs), Financial Visual-Language Foundation Models (FinVLFMs), multimodal financial foundation models that ingest interleaved text, numerical, audio, image, video, and tabular inputs, and event-level generative models for market microstructure (Chen et al., 7 Jul 2025, Yanglet et al., 15 May 2025, Kawawa-Beaudan et al., 27 Feb 2026). The central motivation is that finance is simultaneously multimodal, temporally non-stationary, compliance-constrained, and risk-sensitive; accordingly, the practical value of FFMs depends not only on raw predictive accuracy, but also on calibration, robustness, latency, auditability, and deployment cost (Chen et al., 7 Jul 2025, Lakkaraju et al., 17 Feb 2025).

1. Conceptual scope and taxonomy

The recent survey literature defines a foundation model as “a machine learning or deep learning model that is trained on vast datasets so it can be applied across a wide range of downstream tasks,” and defines a FinLLM as “a foundation model for financial applications” (Yanglet et al., 15 May 2025). The broader FFM category emerged because general-purpose models such as GPT-4 and Gemini showed promising performance on finance tasks, yet remained constrained by finance-specific demands including multimodal reasoning, long-horizon temporal dependence, regulatory compliance, privacy, auditability, and risk controls (Chen et al., 7 Jul 2025).

A common taxonomy partitions FFMs into three major families: FinLFMs for textual tasks such as reports, news, contracts, and retrieval-based question answering; FinTSFMs for numerical sequences such as prices, volatility, Value-at-Risk, and order flow; and FinVLFMs for charts, tables, filings, and scanned financial documents (Chen et al., 7 Jul 2025). A related extension is the Multimodal Financial Foundation Model (MFFM), defined as a model that can “digest interleaved multimodal financial data, including fundamental data, market data, data analytics, macroeconomics, and alternative data (e.g., natural language, audio, image and video)” (Yanglet et al., 15 May 2025). TradeFM adds a distinct event-level branch to this landscape by formulating market microstructure as generative autoregressive sequence modeling over discrete trade events rather than over prices or documents (Kawawa-Beaudan et al., 27 Feb 2026).

Family	Primary inputs	Representative examples
FinLFMs	News, filings, reports, contracts	BloombergGPT, FinGPT, FinQwen, FinLLaMA
FinTSFMs	Prices, returns, volatility, order flow	TimesFM, Chronos, MOMENT, TradeFM
FinVLFMs / MFFMs	Charts, tables, images, audio, video, text	FinLLaVA, FinTral, FinVis-GPT

This taxonomy is not merely organizational. It reflects distinct inductive biases and deployment regimes. FinLFMs are often optimized through continual pretraining, instruction tuning, retrieval augmentation, and compliance alignment; FinTSFMs rely on temporal patching, sequence forecasting, or event tokenization; and FinVLFMs typically combine a finance-capable LLM decoder with modality-specific encoders and projection layers (Chen et al., 7 Jul 2025, Yanglet et al., 15 May 2025). A plausible implication is that claims about “FFM performance” are usually task-family specific rather than universal.

2. Representational design across modalities

In textual finance, FFMs inherit standard subword tokenization and decoder- or encoder-based transformer backbones, but specialization depends heavily on domain corpora and retrieval over filings, regulations, and firm knowledge bases (Chen et al., 7 Jul 2025). Multimodal systems extend this design with vision encoders for charts and tables, audio encoders for earnings calls, and alignment modules inspired by BLIP-2- and LLaVA-like architectures. The literature repeatedly emphasizes that real financial decision contexts interleave modalities: earnings conference calls combine textual content with audio tone, filings interleave prose with tables and figures, and monetary policy conferences require audio-text-video synchronization (Yanglet et al., 15 May 2025).

For time series, FinTSFMs adopt several distinct representations. TimesFM is described as a decoder-only transformer optimized for long-horizon forecasting via patching; Chronos discretizes continuous values into tokens and treats forecasting as sequence modeling; and MOMENT is an encoder-only model supporting forecasting, classification, anomaly detection, and imputation (Goel et al., 2024, Lakkaraju et al., 17 Feb 2025). In multimodal forecasting robustness experiments, numeric sequences are paired with their line-plot images, so that a multi-modal FMTS receives both the time-indexed series and its visual representation (Lakkaraju et al., 17 Feb 2025).

TradeFM illustrates a more specialized representational strategy for microstructure. Its atomic input is a trade event

$e_t = (\Delta t_t, \delta p_t, v_t, a_t, s_t),$

where the core features are interarrival time, price depth, volume, action type, and side. To enable cross-asset transfer, the model uses scale-invariant features, hybrid quantile/log binning, and a universal tokenization scheme that maps heterogeneous order-flow tuples into a single discrete sequence with predictable vocabulary size $16{,}384$ (Kawawa-Beaudan et al., 27 Feb 2026). The exact pretraining objective is standard autoregressive next-token modeling,

$\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$

but the novelty lies in the representation: separate embeddings per feature, concatenation, projection, and closed-loop evaluation in a deterministic simulator (Kawawa-Beaudan et al., 27 Feb 2026).

A different representational challenge arises in purely structured tabular finance. In bankruptcy prediction, Llama-3.3-70B-Instruct is prompted with serialized company-year features, while TabPFN is used as a meta-learned tabular transformer; both are compared against tree ensembles on 131 structured features covering liquidity, profitability, leverage, efficiency, turnover cycles, growth, sector-relative indicators, firmographics, and macro-relative variables (Kostrzewa et al., 20 Nov 2025). The results there underscore that generic language representations do not automatically confer strong tabular inductive bias.

3. Training, continued pretraining, and adaptation

A major specialization route for language-oriented FFMs is domain-adaptive pretraining (DAPT), i.e., continued unsupervised next-token prediction on in-domain corpora rather than full retraining from scratch. In an SEC-filings study, Llama-3.2-1B and Llama-3.2-3B are continued-pretrained on a 400M-token corpus of 10-K, 10-Q, and DEF 14A filings, with validation checkpoints at 50M, 100M, 200M, and 400M tokens (Ponnock, 13 Dec 2025). Both models exhibit monotonic SEC-domain validation-loss reductions, with the largest gains within the first 200M tokens and diminishing returns thereafter; general-domain validation loss on Wikipedia remains effectively unchanged, with fluctuation “on the order of 0.01,” indicating negligible drift and no signs of catastrophic forgetting (Ponnock, 13 Dec 2025). The same work fits the standard power law

$L(N) = L_{\infty} + A N^{-\alpha}$

and interprets the shallow exponent as evidence that SEC language is highly regular and efficiently learnable under continued pretraining (Ponnock, 13 Dec 2025).

Multimodal specialization typically adds instruction tuning and alignment to continued pretraining. Open-FinLLMs described in the MFFM position paper include FinLLaMA, which uses continual pretraining on 18B general and 52B financial tokens, and FinLLaVA, a multimodal extension trained with 1.43M image-text pairs; FinTral adds domain pretraining on 20B tokens, instruction tuning, alignment via AI feedback, and then multimodal instruction tuning (Yanglet et al., 15 May 2025). FinTral’s alignment stage uses Direct Preference Optimization on a preference dataset built from GPT-4 positive outputs and weaker-model negatives, specifically to reduce hallucinations (Yanglet et al., 15 May 2025).

For time-series FFMs, the literature distinguishes zero-shot transfer from domain adaptation. In Value-at-Risk forecasting, TimesFM is evaluated both in zero-shot mode and after fine-tuning its quantile heads at $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ ; the study concludes that zero-shot use is not optimal for VaR and that fine-tuning is practically necessary (Goel et al., 2024). In realized-volatility forecasting, TimesFM v2.0 is adapted by incremental learning over sequential out-of-sample blocks, using PatchedDecoderFinetuneModel with linear probing, Adam, cosine learning rate decay from $1\times10^{-3}$ to $1\times10^{-4}$ , gradient clipping $100$, EMA decay $0.9999$, and early stopping with patience $5$ (Goel et al., 16 May 2025). The emphasis on temporal ordering reflects a finance-specific requirement: adaptation must improve local fit without introducing look-ahead bias.

Parameter-efficient adaptation remains important, but recent work also argues that merely preserving a TSFM’s original architecture and objective can be insufficient in finance. RefineBridge therefore leaves the base TSFM frozen and learns a separate Schrödinger-Bridge refinement module that treats the TSFM forecast as a generative prior and transports it toward the observed target through a context-conditioned stochastic map (Bolton et al., 25 Dec 2025). Its complementary denoising objective,

$16{,}384$ 0

is explicitly designed to correct coarse, mean-reverting TSFM outputs under non-stationarity, heavy tails, and regime shifts (Bolton et al., 25 Dec 2025). This suggests that, in finance, adaptation is often most effective when it changes not just parameters but also the inference geometry.

4. Empirical evidence across financial tasks

The empirical record for FFMs is heterogeneous. In purely structured, large-scale bankruptcy prediction, the evidence is unfavorable. On 1,106,879 firm-year records from the Visegrád Group and five default horizons $16{,}384$ 1, XGBoost and CatBoost consistently outperform both Llama-3.3-70B-Instruct and TabPFN across AUROC and F1 (Kostrzewa et al., 20 Nov 2025). Representative test results include XGBoost at $16{,}384$ 2 with AUROC/F1 $16{,}384$ 3, CatBoost at $16{,}384$ 4 with $16{,}384$ 5, TabPFN-DT at $16{,}384$ 6 with $16{,}384$ 7, and Llama-3.3 zero-shot at $16{,}384$ 8 with $16{,}384$ 9 (Kostrzewa et al., 20 Nov 2025). The study’s direct conclusion is that “models such as XGBoost and CatBoost consistently outperform foundation models across all prediction horizons,” and that LLM-based approaches suffer from unreliable probability estimates (Kostrzewa et al., 20 Nov 2025). This directly contradicts the common misconception that broader pretraining alone is sufficient for high-stakes tabular risk prediction.

For tail-risk forecasting, the picture is different. In VaR forecasting on the S&P 100 index and 91 constituents, fine-tuned TimesFM ranks as the best or among the top performers across $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 0, especially in actual-over-expected calibration and unconditional/conditional coverage tests (Goel et al., 2024). Aggregated $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 1 results show, for example, FT21 mean $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 2 and FT63 $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 3 at $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 4 VaR versus GAS $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 5, and FT1 $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 6 at $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 7 VaR versus GAS $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 8 and Historical $\mathcal{L} = - \sum_{t=1}^{T} \log p(x_t \mid x_{<t}),$ 9 (Goel et al., 2024). The study also states that fine-tuned TimesFM is comparable to GAS on quantile score while significantly outperforming zero-shot TimesFM, reinforcing that domain adaptation rather than naïve transfer drives the gains (Goel et al., 2024).

Realized-volatility forecasting yields a similarly conditional verdict. Across 21 equity indices from the Oxford-Man Realized Library, the best cross-metric configuration is the incrementally fine-tuned log-RV model $L(N) = L_{\infty} + A N^{-\alpha}$ 0, with MSE $L(N) = L_{\infty} + A N^{-\alpha}$ 1, MAD $L(N) = L_{\infty} + A N^{-\alpha}$ 2, MAPE $L(N) = L_{\infty} + A N^{-\alpha}$ 3, MDA $L(N) = L_{\infty} + A N^{-\alpha}$ 4, and sMAPE $L(N) = L_{\infty} + A N^{-\alpha}$ 5 (Goel et al., 16 May 2025). The best QLIKE is instead achieved by zero-shot $L(N) = L_{\infty} + A N^{-\alpha}$ 6 at $L(N) = L_{\infty} + A N^{-\alpha}$ 7, improving over HAR $L(N) = L_{\infty} + A N^{-\alpha}$ 8 and CHAR $L(N) = L_{\infty} + A N^{-\alpha}$ 9 (Goel et al., 16 May 2025). Diebold-Mariano and Giacomini-White results indicate that CHAR and HAR are statistically worse than $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 0 under QLIKE, while $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 1 is reported as never statistically worse than alternatives in MSE, MAPE, and sMAPE (Goel et al., 16 May 2025).

Operational forecasting studies on exchange rates further complicate the assessment. In strict zero-shot forecasting of 8 daily FX series with horizon $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 2, TimesFM 2.5 achieves the best MASE at $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 3, while XGBoost remains best on sMAPE at $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 4 and RMSE at $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 5 (Soni et al., 23 May 2026). Chronos reaches MASE $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 6, close to PatchTST $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 7 and DLinear $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 8, whereas TimesFM 2.0 performs poorly with MASE $\alpha \in \{0.01, 0.025, 0.05, 0.10\}$ 9 (Soni et al., 23 May 2026). The authors interpret this as evidence that newer FFMs are rapidly closing the gap with supervised specialists in stochastic financial markets, but metric sensitivity remains material: MASE, sMAPE, and RMSE do not induce the same ranking (Soni et al., 23 May 2026).

Market microstructure provides the clearest case for a genuinely finance-native FFM. TradeFM is a 524M-parameter decoder-only Transformer pretrained on over 10 billion tokens from more than 9,000 US equities and evaluated in a deterministic market simulator (Kawawa-Beaudan et al., 27 Feb 2026). Its generated rollouts reproduce heavy tails, volatility clustering, and near-zero return autocorrelation; at a $1\times10^{-3}$ 0-second return horizon, mean K–S distance is $1\times10^{-3}$ 1 for TradeFM versus $1\times10^{-3}$ 2 for Hawkes and $1\times10^{-3}$ 3 for Zero-Intelligence, with corresponding $1\times10^{-3}$ 4 distances $1\times10^{-3}$ 5, $1\times10^{-3}$ 6, and $1\times10^{-3}$ 7 (Kawawa-Beaudan et al., 27 Feb 2026). The model also shows zero-shot transfer to APAC markets with moderate perplexity degradation. Here the evidence supports a stronger claim: large-scale generative pretraining on event-level financial data can capture transferable microstructure structure that classical parametric baselines do not recover (Kawawa-Beaudan et al., 27 Feb 2026).

Finally, RefineBridge indicates that FFMs can be improved without altering the backbone. Across S&P 500, WTI, and EUR/USD, it improves state-of-the-art TSFMs in $1\times10^{-3}$ 8 configurations, with aggregate wins in $1\times10^{-3}$ 9 Chronos cases, $1\times10^{-4}$ 0 Moirai cases, and $1\times10^{-4}$ 1 Time-MoE cases (Bolton et al., 25 Dec 2025). Illustrative reductions include Chronos on the S&P 500 at horizon $1\times10^{-4}$ 2: MSE/MAE $1\times10^{-4}$ 3, and Moirai on EUR/USD at horizon $1\times10^{-4}$ 4: $1\times10^{-4}$ 5 (Bolton et al., 25 Dec 2025). This suggests that the empirical limits of TSFMs in finance may depend as much on post-hoc refinement and adaptation design as on backbone scale.

5. Evaluation, robustness, and operational risk

Because finance is risk-sensitive, FFM evaluation is broader than average predictive error. The bankruptcy study explicitly distinguishes discrimination metrics such as AUROC from calibration-sensitive criteria such as Brier score, log loss, and Expected Calibration Error, and stresses that in extreme imbalance AUROC can overstate performance while AUPRC, calibration, and cost-sensitive loss may be more decision-relevant (Kostrzewa et al., 20 Nov 2025). Its qualitative calibration analysis shows LLM self-reported probabilities clustering at fixed values such as $1\times10^{-4}$ 6, $1\times10^{-4}$ 7, $1\times10^{-4}$ 8, and $1\times10^{-4}$ 9, which the paper interprets as degenerate calibration rather than smooth risk scoring (Kostrzewa et al., 20 Nov 2025). This is a central controversy in finance-oriented LLM use: high-level reasoning or acceptable classification accuracy does not imply usable probability outputs.

Robustness under perturbations has been formalized more explicitly in work on FMTS. A causally grounded rating framework evaluates six models on stock forecasting using company or industry as confounders, perturbations $100$0 (drop-to-zero), $100$1 (value-halved), and $100$2 (missing values), and residual maxima over a $100$3-step horizon (Lakkaraju et al., 17 Feb 2025). The framework defines the Weighted Rejection Score

$100$4

the Average Perturbation Effect

$100$5

and the Propensity-score Impact Estimation percentage

$100$6

Within that setup, multi-modal FMTS are more robust and often more accurate than their uni-modal counterparts, and TS-pretrained models such as Chronos and MOMENT are more robust than general-purpose multimodal models adapted zero-shot by prompting (Lakkaraju et al., 17 Feb 2025). The user study in the same paper reports that the ratings reduce perceived difficulty in robustness comparison and align with participant rankings under several conditions (Lakkaraju et al., 17 Feb 2025).

Operational viability adds another layer. On exchange rates, measured P95 latencies are $100$7 ms for TimesFM 2.0, $100$8 ms for TimesFM 2.5, and $100$9 ms for Chronos, compared with $0.9999$0 ms for XGBoost and $0.9999$1 ms for DLinear (Soni et al., 23 May 2026). The same paper proposes a Complexity Router that uses spectral entropy, coefficient of variation, seasonal autocorrelation, and trend strength to decide whether a series should be handled by an FFM or a specialist. At the Pareto knee $0.9999$2, routing achieves MASE $0.9999$3 at $0.9999$4 normalized cost, versus universal FFM deployment at MASE $0.9999$5 and $0.9999$6 cost, and all-specialist deployment at MASE $0.9999$7 (Soni et al., 23 May 2026). This suggests that production deployment may favor selective routing rather than monolithic FFM use.

Risk-return analysis has also been extended to strategies derived from shared pretrained backbones. “Trading with the Devil” introduces an FM-CAPM view in which a family of strategies

$0.9999$8

shares a common backbone $0.9999$9, and the corresponding Pretrained Market Line is

$5$0

The paper maps systematic FM-related risk to epistemic uncertainty and idiosyncratic adaptation risk to aleatory uncertainty, estimating the epistemic component with Monte Carlo dropout (Zhang, 20 Oct 2025). Empirically, FM-based strategies cluster tightly in the risk-return plane, and estimated PML slopes are $5$1 for TimesFM-v2, $5$2 for Moirai, and $5$3 for Chronos, with markedly lower $5$4 for non-foundation baselines (Zhang, 20 Oct 2025). A plausible implication is that widespread adoption of the same backbone can create a shared, crowding-like risk factor.

6. Applications, governance, and open directions

The application surface of FFMs is already broad. In financial engineering surveys, reported use cases include report summarization, information extraction, sentiment analysis, risk management, portfolio construction, chart and table understanding, investment advisory, compliance analysis, alpha factor mining, and agentic research automation (Chen et al., 7 Jul 2025, Yanglet et al., 15 May 2025). The SecureFinAI Lab’s FinAgents connect assistants to local files, web APIs, databases, and social platforms through the Model Context Protocol and Agent2Agent protocol, enabling search agents, tutor agents, robo-advisors, auditing agents, compliance agents, report-generation agents, and trading agents (Yanglet et al., 15 May 2025). TradeFM adds synthetic order-flow generation, stress testing, and reinforcement-learning environments for execution and multi-agent studies (Kawawa-Beaudan et al., 27 Feb 2026).

Governance and reproducibility remain central constraints. The multimodal finance literature highlights privacy and ethics, “model cannibalism,” “openwashing,” multimodal data synchronization failures, and the challenge of building trustworthy retrieval and tool-augmented systems on regulated data (Yanglet et al., 15 May 2025). The same paper proposes a guardrail framework combining zero-knowledge proofs and permissioned blockchain, citing zkLLM verification of a 13B LLM in under 15 minutes with proof size under 200KB, and arguing that ZKPs can certify adherence to approved inference schemes without exposing sensitive data (Yanglet et al., 15 May 2025). Survey work on FFMs also emphasizes retrieval logs, document provenance, prompt/version control, federated learning, differential privacy, and temporally correct evaluation as prerequisites for deployment in regulated settings (Chen et al., 7 Jul 2025).

Several forward directions recur across the literature. One is multimodal fusion: combining text, charts, tables, audio, video, market data, and alternative data so that FFMs are not forced into unimodal approximations of inherently multimodal financial processes (Yanglet et al., 15 May 2025, Chen et al., 7 Jul 2025). A second is scalable domain adaptation: DAPT results on SEC filings suggest that 7B–70B financial LLMs may be adaptable with low-billions token budgets, with heuristic ranges of approximately $5$5 tokens for 7B, $5$6 for 13B, $5$7 for 32B, and $5$8 for 70B (Ponnock, 13 Dec 2025). A third is better calibration, cost-aware thresholding, and proper scoring for financial decision support, particularly where current LLM interfaces expose only verbalized probabilities or where zero-shot forecasters remain rigid under drift (Kostrzewa et al., 20 Nov 2025, Soni et al., 23 May 2026).

The strongest general conclusion in the present literature is therefore non-universal. FFMs are not uniformly superior to specialized financial models. In large, imbalanced, purely tabular risk prediction, specialized tree ensembles remain clearly preferred (Kostrzewa et al., 20 Nov 2025). In tail-risk and volatility forecasting, fine-tuned or incrementally adapted TSFMs can match or surpass classical econometric baselines (Goel et al., 2024, Goel et al., 16 May 2025). In event-level market simulation, finance-native generative pretraining has already produced capabilities not well matched by classical Hawkes or zero-intelligence baselines (Kawawa-Beaudan et al., 27 Feb 2026). The field is moving toward hybrid systems in which pretraining, retrieval, multimodal alignment, causal robustness assessment, and governance mechanisms are treated as co-equal components of financial model design rather than as optional add-ons to a single universal backbone.