Unified Forecasting Model

Updated 17 November 2025

Unified Forecasting Model is a unified deep learning architecture that leverages multimodal inputs—numerical, textual, and visual—to address heterogeneous forecasting tasks.
It employs cross-modal attention and parameter sharing to fuse standardized embeddings from different modalities, enhancing prediction accuracy across domains.
Empirical evaluations show improvements in key metrics like accuracy and Sharpe ratio, while maintaining scalability and performance even with reduced training data.

Unified Forecasting Model refers to the architecture and methodology, particularly in deep learning and transformer-based frameworks, for constructing a single predictive engine capable of addressing heterogeneous time series forecasting tasks—including variations in domain, data modality, forecasting horizon, and granularity—without the need for dedicated, task-specific models or retraining for each case. Such models integrate cross-domain, cross-modality, and cross-frequency capabilities through unified data encodings, architectural innovations (e.g., cross-modal attention), parameter sharing, and prompt-driven adaptation.

1. Core Design Principles and Input Modalities

Unified forecasting models are built on two foundational premises: (a) the existence of generic temporal and contextual structures across domains that can be exploited by sufficiently flexible neural architectures, and (b) the need for cross-modal fusion to leverage complementary information (e.g., numerical, textual, visual) within one pipeline.

Inputs are typically structured to support multimodality:

Numerical indicators: e.g., OHLCV (Open, High, Low, Close, Volume), market returns, volatility. These are standardized, often lagged to maintain causal alignment; e.g., STONK (Khanna et al., 18 Aug 2025) uses $x_i \in \mathbb{R}^{d_n}$ with $d_n=8$ .
Textual embeddings: News or descriptive text represented via transformer-derived embeddings (e.g., DeBERTa, MiniLM, FinBERT), producing $y_i \in \mathbb{R}^{d_t}$ .
Other modalities: Vision (e.g., plot images tokenized into patch embeddings), categorical marks, etc. UniCast (Park et al., 16 Aug 2025) extends foundation models to vision and text via additional encoder branches.

Unified models may preprocess each modality independently before fusion. Preprocessing includes domain-adaptive sentiment scoring for text, standardization via StandardScaler for numeric, and patch extraction/tokenization for sequential inputs.

Fusion of features from disparate modalities is achieved either by direct concatenation or hierarchical cross-modal attention. STONK demonstrates both:

Simple concatenation: $m_i = [\,W_x\,x_i \;\|\; W_y\,y_i\,] \in \mathbb{R}^{d_m}$ , with independent linear projections $W_x, W_y$ .
Cross-modal attention: Numeric projection $X_i = W_x x_i$ and textual $Y_i = W_y y_i$ interact via multi-head dot-product attention:

$Q = X_i W_Q,\ K = Y_i W_K,\ V = Y_i W_V$

$\mathrm{Head}_\ell = \mathrm{softmax}\left(\frac{Q W_Q^\ell (K W_K^\ell)^\top}{\sqrt{d_k}}\right)(V W_V^\ell)$

$A_i = [\mathrm{Head}_1;\dots;\mathrm{Head}_h] W_O$

Final fusion vector: $m_i = [X_i \;\|\; A_i] W_f$ (with LayerNorm, dropout pre-fusion).

Parameter-efficient multimodal fusion is presented in UniCast, where visual and textual embeddings from frozen foundation model encoders are mapped into a joint sequence via learnable linear projections and soft prompt injection, enabling transfer and interaction within a shared Transformer backbone.

3. Prediction Head, Training Objective, and Regularization

Predictions are commonly formulated as classification (e.g., binary Up/Down stock movement in STONK) or regression (point/horizon forecasting):

Logistic regression (classification):

$\hat{y}_i = \sigma(w^\top m_i + b),\quad \sigma(z)=\frac{1}{1+\exp(-z)}$

Forecasting loss (mean-square error, regression):

$\mathcal{L}_{forecast}(\mathbf{Y}, \hat{\mathbf{Y}}) = \frac{1}{H} \sum_{h=1}^H \|\hat{\mathbf{y}}_{T+h} - \mathbf{y}_{T+h}\|_2^2$

Binary cross-entropy (classification):

$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N [y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]$

Regularization is typically imposed via $L_2$ penalties on output weights, dropout after projection, and specialized balancing techniques (e.g., SMOTE for class imbalance, as in STONK).

4. Implementation Strategies and Scalability

Architectural choices directly affect computational requirements and generalization. Typical strategies:

Embedding dimensions: $d_n = 8$ (numeric), $d_t = 768$ (text), projected to $d_m = 64$ for fusion in STONK.
Attention heads: $h = 8$ (with $d_k = d_m/h = 8$ for each head).
Optimizer: AdamW, with $\text{lr}=1\mathrm{e}{-4}$ for fusion pipeline; lower learning rates (e.g., $3\mathrm{e}{-5}$ ) for fine-tuning foundation text encoders.
Batch size and epochs: 16 (text encoder fine-tuning), 3 epochs (FiQA+Financial PhraseBank).
Validation: 5-fold TimeSeriesSplit.
Generalization: Freezing foundation encoders and tuning only soft prompts (UniCast) enables efficient adaptation with <7% parameter update overhead and rapid convergence (3–4 epochs).

Unified models support scaling both across modalities and domains, with empirical evaluations demonstrating performance retention even under severely reduced training data (e.g., UniCast retains superiority with 25% dataset size).

5. Backtesting, Ablation, and Comparative Performance

Performance evaluation emphasizes both statistical metrics (accuracy, F1, MCC, MSE, MAE) and financial backtesting (Profit Factor, Sharpe ratio):

Fusion Strategy	Acc	F1	PF	Sharpe	MCC
Numeric only (LR)	0.55	0.45	1.22	0.79	-
Numeric + Sentiment Score	0.62	0.70	1.57	2.14	-
Concat (MiniLM, pre-tune)	0.65	0.72	2.03	3.15	0.27
Concat (MiniLM, finetune)	0.67	0.75	1.72	2.57	0.31
XMA (DeBERTa, pre-tune)	0.68	0.73	1.75	2.24	0.32
XMA (DeBERTa, finetune)	0.67	0.70	1.88	2.55	0.33

Ablation results confirm that textual sentiment embeddings substantially boost both classification metrics (+0.07–0.13 accuracy) and profitability (Sharpe +1.35 over numeric-only).

6. Limitations and Future Directions

Limitations in current unified models include:

Error accumulation in autoregressive architectures (iterative prediction may degrade long-horizon accuracy).
Computational complexity: Transformer-based attention mechanisms may exhibit $O(N^2T^2)$ scaling, impacting feasibility on high-dimensional or ultra-long time series.
Modality bottleneck: Fusion performance and capacity may depend heavily on choice and pretraining quality of foundation encoders.
Balancing and overfitting: Imbalanced financial datasets necessitate explicit balancing; ablation reveals vulnerabilities when sentiment or cross-modal features are omitted.

Future research may explore:

Non-autoregressive decoders and segmentwise parallel forecasting (see KAIROS (Ding et al., 2 Oct 2025)).
Extension of unified pipelines to richer multimodal contexts, including audio and spatiotemporal signals (see UniCast (Park et al., 16 Aug 2025)).
Advanced regularization (prompt location, modality balancing).
Incorporation of domain instructions and natural-language metadata for cross-domain adaptation.
Improved efficiency through sparse attention or continual learning approaches.

Unified forecasting thus defines the state-of-the-art paradigm for building scalable, generalizable, and multimodal prediction models, aligning technical advances in cross-modal fusion, self-attention architectures, and foundation model transfer toward domain-agnostic time series prediction.

PDF Markdown Chat (Pro)

References (3)

Towards Unified Multimodal Financial Forecasting: Integrating Sentiment Embeddings and Market Indicators via Cross-Modal Attention (2025)

UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting (2025)

KAIROS: Unified Training for Universal Non-Autoregressive Time Series Forecasting (2025)

Follow Topic

Get notified by email when new papers are published related to Unified Forecasting Model.