Papers
Topics
Authors
Recent
Search
2000 character limit reached

FinCast: Billion-Parameter Financial Forecasting

Updated 21 February 2026
  • FinCast is a billion-parameter decoder-only Transformer designed for financial time-series forecasting, using patch tokenization and frequency embeddings.
  • It employs a composite Point-Quantile loss with MoE regularization to address non-stationarity and improve prediction accuracy across diverse financial domains.
  • The model achieves robust zero-shot performance with minimal fine-tuning, reducing MSE by approximately 20% over state-of-the-art baselines on varied datasets.

FinCast is a 1-billion-parameter decoder-only Transformer foundation model designed for financial time-series forecasting, specifically addressing the challenges of temporal non-stationarity, multi-domain diversity, and varying temporal resolutions. Distinguished by robust zero-shot generalization and minimal fine-tuning requirements, FinCast integrates architectural and training innovations to exceed state-of-the-art accuracy on diverse financial datasets spanning cryptocurrencies, forex, stocks, futures, and macroeconomic indicators (Zhu et al., 27 Aug 2025).

1. Architectural Components

FinCast's architecture centers on a decoder-only Transformer backbone, augmented by targeted mechanisms for financial time-series complexity:

  • Input Tokenization Block: Raw multivariate time series XRB×LX\in\mathbb{R}^{B\times L} are segmented into non-overlapping patches of length PP, yielding N=L/PN=\lfloor L/P\rfloor input tokens. Each patch undergoes per-instance normalization:

X~n,p=Xn,pμnσn,μn=1Pp=1PXn,p,σn=1Pp=1P(Xn,pμn)2\tilde X_{n,p} = \frac{X_{n,p} - \mu_n}{\sigma_n},\quad \mu_n = \frac{1}{P}\sum_{p=1}^P X_{n,p},\quad \sigma_n = \sqrt{\frac{1}{P}\sum_{p=1}^P (X_{n,p} - \mu_n)^2}

During training, 15% of patch positions are masked, and a residual MLP projects each normalized patch to a DmodelD_\text{model}-dimensional token. Frequency embeddings Embfreq(f)RDmodel\mathrm{Emb}_{\mathrm{freq}}(f)\in\mathbb{R}^{D_\text{model}}, indexed by temporal resolution (minute, hour, day, etc.), are added to every token to enable cross-frequency generalization.

  • Decoder-MoE Backbone: The core comprises LblocksL_{\text{blocks}} stacked Transformer decoder blocks, each featuring:

    • RMSNorm:

    RMSNorm(h)=γ  h1Nihi2+ϵ\mathrm{RMSNorm}(h) = \gamma\;\frac{h}{\sqrt{\tfrac{1}{N}\sum_i h_i^2 + \epsilon}}

    with learned γ\gamma. - Causal Self-Attention: Linear projection for Q/K/VQ/K/V, adaptive per-dimension query scaling, causal masking to enforce autoregression, and residual connections. Attention weights are computed as:

    [Q,K,V]=hnormWqkv[Q,K,V] = h_{\mathrm{norm}}\,W_{qkv}

    Q=Qlog2edsoftplus(α)Q' = Q \odot \frac{\log_2 e}{\sqrt{d}\,\mathrm{softplus}(\alpha)} - Token-level Sparse Mixture-of-Experts (MoE): Each token is routed to its top-kk out of EE experts via a softmax gate, with only selected experts participating in computation:

    si,n=softmaxi(Wgatehn)s_{i,n} = \mathrm{softmax}_i(W_{\text{gate}} h_n)

    MoE(hn)=i=1Egi,nMLPi(hn)\mathrm{MoE}(h_n) = \sum_{i=1}^E g_{i,n} \mathrm{MLP}_i(h_n)

    The sparse MoE structure achieves specialization across domains (e.g., stocks, crypto) and scalability.

  • Output Block: Each final token hnh_n' is decoded by a residual two-layer MLP to reconstruct a forecast patch. Patch-level inverse normalization restores the output to original scale.

2. Training Objectives and Optimization

FinCast employs a composite “Point-Quantile” loss for robust distributional forecasting under non-stationarity, augmented by MoE-specific regularization:

  • Huber Point Loss addresses the mean forecast, balancing sensitivity to outliers and stable optimization:

Lpoint=1Ht=1H{12(y^tyt)2,y^tytδ δ(y^tyt12δ),otherwise\mathcal{L}_{\text{point}} = \frac{1}{H}\sum_{t=1}^{H} \begin{cases} \frac{1}{2}(\hat{y}_t - y_t)^2, & |\hat{y}_t - y_t| \leq \delta \ \delta(|\hat{y}_t - y_t| - \frac{1}{2}\delta), & \text{otherwise} \end{cases}

  • Trend Consistency Loss on first differences penalizes deviations from true directional movement:

Ltrend=λtrend1H1t=2H[(y^ty^t1)(ytyt1)]2\mathcal{L}_{\text{trend}} = \lambda_{\text{trend}} \frac{1}{H-1}\sum_{t=2}^{H} \bigl[(\hat{y}_t-\hat{y}_{t-1})-(y_t-y_{t-1})\bigr]^2

  • Quantile Loss trains for multiple quantiles (e.g., deciles), allowing calibrated uncertainty estimates:

Lquantile=λquantileqQ1Ht=1H{q(yty^tq),yty^tq (1q)(y^tqyt),yt<y^tq\mathcal{L}_{\text{quantile}} = \lambda_{\text{quantile}} \sum_{q\in\mathcal{Q}}\frac{1}{H}\sum_{t=1}^{H} \begin{cases} q(y_t-\hat{y}_t^q), & y_t\geq\hat{y}_t^q \ (1-q)(\hat{y}_t^q-y_t), & y_t<\hat{y}_t^q \end{cases}

  • MoE Regularization (LMoE\mathcal{L}_{\text{MoE}}) combines balance and routing entropy penalties to prevent expert collapse:

LMoE=λMoE(Lbalance+Lrouter-z)\mathcal{L}_{\text{MoE}} = \lambda_{\text{MoE}}(\mathcal{L}_{\text{balance}} + \mathcal{L}_{\text{router-z}})

Point-Quantile objective components mitigate mean-regression under non-stationarity. MoE regularization ensures diversity and specialism among routing experts.

3. Pretraining Data and Computational Protocol

FinCast is pretrained on an extensive and heterogeneous dataset encompassing 2.4 million financial series and over 20 billion time points, distributed across:

Domain Time Points (B = billion)
Crypto 1.78 B
Forex 3.27 B
Futures 1.71 B
Stocks 9.10 B
Macroeconomic 0.0041 B
Other 4.61 B

All datasets are standardized: invalid timestamps and outliers are filtered, and observation grids are aligned. Pretraining uses variable context lengths (up to 1024 for granular, 256 for coarse time series), AdamW optimization (2×1042\times 10^{-4} learning rate; 0.05 weight decay), a global batch of 8192, and a staged learning rate schedule (5% warmup, 30% plateau, cosine decay to 10%). Training is performed for 147,152 steps on 8 NVIDIA H200 GPUs, all weights in FP32, with high-throughput TF32 computation.

Each MoE layer implements E=4E=4 experts with token-level top-k=2k=2 routing and a 15% patch-wise input masking ratio.

4. Zero-Shot Generalization and Fine-Tuning Results

FinCast establishes state-of-the-art generalization in both zero-shot and minimally fine-tuned scenarios.

  • Zero-Shot Evaluation: On a held-out benchmark of 3,632 series (4.38M points) spanning crypto, forex, stocks, and futures at granular-to-weekly frequencies (horizons h{10,30,60}h\in\{10,30,60\}; context L=128L=128), average error metrics were:
Model MSE MAE
FinCast 0.1644 0.2397
TimesFM-200M 0.2537 0.2888
TimesFM-500M 0.2411 0.2836
Chronos (variants) 0.1860–0.1911 0.2537–0.2570
TimesMOE-Large 0.1858 0.2571

FinCast achieves an average 20% reduction in MSE relative to the next best baseline.

  • Supervised Fine-Tuning: On US_71 and US_14L stock benchmarks (2016–2023 and 2005–2023, respectively), FinCast undergoes one epoch of task-specific tuning, updating only the output block and the last 10% of MoE layers, with the following outcomes:
Scenario MSE MAE
FinCast (zero-shot) 0.3092 0.3630
Best task-specific (PCIE) 0.3261 0.3736
FinCast (fine-tuned) 0.2971 0.3505

FinCast consistently outperforms all specialized supervised baselines, both in zero-shot and fine-tuned configurations. Fine-tuning yields an additional 6–8% reduction in predictive error.

5. Ablation and Component Analysis

Comprehensive ablations underline the contribution of each architectural and objective component:

Component Removed MSE MAE Δ MSE (%)
Sparse MoE 0.1802 0.2617 –9.32
Point-Quantile loss 0.1767 0.2582 –7.62
Frequency embeddings 0.1713 0.2505 –4.38

Sparse token-level MoE is essential for cross-domain specialization and capacity scaling. The combined point and quantile objectives alleviate mean regression collapse and enhance stability under non-stationarity. Learnable frequency embeddings enable coherent performance across temporal resolutions.

6. Deployment and Practical Recommendations

Recommended deployment strategies for FinCast include:

  • Model Configuration: 1B parameters, MoE layers with E=4E=4 experts, top-k=2k=2 routing.
  • Input Parameters: Patch length PP such that L×PL \times P abides by the desired temporal horizon (e.g., L=128L=128 for minute-level prediction).
  • Training Protocol: 15% patch-mask ratio; AdamW optimizer with 2×1042\times10^{-4} initial learning rate, 0.05 weight decay; staged schedule with 5% warmup, 30% constant, cosine decay to 0.1×.
  • Fine-Tuning: One epoch on new domain; update only output head and the final 10% of MoE experts; freeze input and most MoE blocks for stability.
  • Inference: Supports efficient patch-wise auto-regressive decoding. Can run on a single modern 8 GB GPU (e.g., RTX 4060), enabled by MoE sparsity and input tokenization.

A plausible implication is that this design enables transfer across frequency, domain, and horizon with minimal domain-specific adjustments.

7. Significance and Relation to Prior Work

FinCast represents the first foundation model tailored for financial time-series forecasting, integrating explicit mechanisms for non-stationarity, domain diversity, and cross-frequency consistency. The empirical superiority over TimesFM, Chronos, and TimesMOE-Large underpins the impact of its architectural and training advances. Its ability to forecast across diverse high- and low-frequency financial domains without fine-tuning, and to further improve with modest supervised updates, demonstrates generalization exceeding bespoke models and prior large-scale time-series Transformers (Zhu et al., 27 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FinCast Model.