Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporal Data Meets LLM -- Explainable Financial Time Series Forecasting (2306.11025v1)

Published 19 Jun 2023 in cs.LG, cs.AI, cs.CL, and q-fin.ST

Abstract: This paper presents a novel study on harnessing LLMs' (LLMs) outstanding knowledge and reasoning abilities for explainable financial time series forecasting. The application of machine learning models to financial time series comes with several challenges, including the difficulty in cross-sequence reasoning and inference, the hurdle of incorporating multi-modal signals from historical news, financial knowledge graphs, etc., and the issue of interpreting and explaining the model results. In this paper, we focus on NASDAQ-100 stocks, making use of publicly accessible historical stock price data, company metadata, and historical economic/financial news. We conduct experiments to illustrate the potential of LLMs in offering a unified solution to the aforementioned challenges. Our experiments include trying zero-shot/few-shot inference with GPT-4 and instruction-based fine-tuning with a public LLM model Open LLaMA. We demonstrate our approach outperforms a few baselines, including the widely applied classic ARMA-GARCH model and a gradient-boosting tree model. Through the performance comparison results and a few examples, we find LLMs can make a well-thought decision by reasoning over information from both textual news and price time series and extracting insights, leveraging cross-sequence information, and utilizing the inherent knowledge embedded within the LLM. Additionally, we show that a publicly available LLM such as Open-LLaMA, after fine-tuning, can comprehend the instruction to generate explainable forecasts and achieve reasonable performance, albeit relatively inferior in comparison to GPT-4.

This paper explores the integration of LLMs with temporal financial data to generate explainable forecasts for stock price movements, specifically focusing on NASDAQ-100 constituents (Yu et al., 2023 ). The core motivation stems from the limitations of traditional financial forecasting models, which often struggle with cross-sequence reasoning (understanding relationships between different stocks or assets), incorporating diverse data types like news and structured metadata, and providing transparent, interpretable outputs. LLMs, with their capacity for natural language understanding, reasoning, and generation, are posited as a potential unified solution to these challenges.

Methodology: Integrating Temporal and Textual Data with LLMs

The paper employs two primary strategies to leverage LLMs for financial forecasting: zero-shot/few-shot inference using a proprietary model (GPT-4) and instruction-based fine-tuning of an open-source model (OpenLLaMA).

  1. Data Sources: The foundation of the experiments rests on publicly available data for NASDAQ-100 stocks. This includes:
    • Historical Stock Prices: Numerical time series data representing daily or intra-day price movements.
    • Company Metadata: Structured information about the companies (e.g., sector, market capitalization).
    • Historical Economic/Financial News: Unstructured textual data relevant to market conditions or specific companies.
  2. Prompt Engineering for Zero/Few-Shot Inference (GPT-4): For GPT-4, the approach relies on carefully crafted prompts that encapsulate the necessary information for forecasting. The key challenge lies in representing both the numerical time series and the unstructured text within the LLM's context window. This likely involves:
    • Serialization of Time Series: Converting recent historical price data (e.g., closing prices, volume over a specific lookback window) into a textual format. This could involve simple comma-separated values, descriptive statistics, or natural language descriptions of trends (e.g., "The stock price increased by 5% over the last week").
    • Integration of News and Metadata: Concatenating relevant news headlines/summaries and company metadata within the prompt. Temporal alignment of news articles with the corresponding price data points is crucial.
    • Instruction Formulation: The prompt explicitly instructs the LLM to predict the future stock price movement (e.g., "predict the direction of stock XYZ for the next trading day") and, critically, to provide a step-by-step explanation for its prediction, citing evidence from the provided price data and news.
  3. Instruction Fine-tuning (OpenLLaMA): To adapt a smaller, open-source model like OpenLLaMA for this task, instruction fine-tuning is employed. This requires creating a dataset of input-output pairs:
    • Input: A prompt similar in structure to the one used for GPT-4, containing serialized price history, relevant news snippets, and metadata for a specific stock at a specific point in time.
    • Output: The target forecast (e.g., "UP", "DOWN", or a specific price range) coupled with a detailed natural language explanation justifying the forecast based on the input data.
    • Fine-tuning Process: The OpenLLaMA model is then fine-tuned on this dataset. This process adjusts the model's parameters to better understand the specific instruction format and the relationships between financial time series, news events, and subsequent price movements, while learning to generate coherent explanations. Techniques like Low-Rank Adaptation (LoRA) might be employed for parameter-efficient fine-tuning, although not explicitly mentioned in the abstract.

The core hypothesis is that the LLM, through either its pre-trained knowledge and reasoning capabilities (GPT-4) or fine-tuned adaptation (OpenLLaMA), can synthesize information from these disparate sources. It is expected to identify correlations between news sentiment/events and price trends, potentially recognize patterns across different stocks (cross-sequence reasoning implicitly learned), and leverage its embedded world knowledge about economics and finance to inform its predictions and explanations.

Experimental Setup and Evaluation

The experiments were designed to evaluate the forecasting performance and explainability of the LLM-based approaches compared to established baseline models.

  • Task: Forecasting future price movements for NASDAQ-100 stocks. The exact prediction target (e.g., directional accuracy, price range prediction) is not specified in the abstract but is implied to be a standard forecasting task.
  • Baselines:
    • ARMA-GARCH: A classic econometric model combining Autoregressive Moving Average (ARMA) for the conditional mean and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) for the conditional variance, capturing time series dynamics and volatility clustering. This represents traditional statistical time series methods.
    • Gradient Boosting Tree Model: Likely implementations such as XGBoost or LightGBM, trained on features derived from historical prices and potentially other technical indicators. This represents common machine learning approaches used in finance.
  • Evaluation: Performance was compared using standard forecasting metrics (potentially accuracy, F1-score for classification, or RMSE/MAE for regression). Crucially, the qualitative aspect of the generated explanations was also assessed, likely through case studies and examples, evaluating their coherence, relevance, and ability to cite evidence from the input data.

Results and Analysis

The paper reports that the LLM-based approaches, particularly GPT-4, outperformed the ARMA-GARCH and gradient boosting baselines on the forecasting task.

  • Performance: GPT-4 demonstrated superior forecasting capability in the zero-shot/few-shot setting. The fine-tuned OpenLLaMA model also achieved "reasonable performance," surpassing the baselines, although it was quantitatively inferior to GPT-4. This suggests that while instruction fine-tuning can adapt open-source models for this complex task, larger, more capable proprietary models currently hold an advantage in leveraging their extensive pre-trained knowledge for zero-shot financial reasoning.
  • Reasoning Capabilities: The results indicate that LLMs can effectively reason over combined numerical (price series) and textual (news) inputs. Examples highlighted in the paper (though not detailed in the abstract) purportedly show the LLM making well-reasoned decisions by:
    • Extracting relevant insights from news articles and correlating them with price movements.
    • Leveraging cross-sequence information (e.g., potentially considering sector trends or movements of correlated stocks, although the mechanism for this isn't detailed).
    • Utilizing inherent financial and economic knowledge embedded within the LLM's parameters.
  • Explainability: Both GPT-4 and the fine-tuned OpenLLaMA were capable of generating natural language explanations for their forecasts, fulfilling the instruction prompt. The quality of explanations from GPT-4 was likely higher, mirroring its superior forecasting performance. The fine-tuned OpenLLaMA demonstrated that even smaller models can be trained to produce justifications, though potentially less nuanced.

Practical Implementation Considerations

Implementing such an LLM-based financial forecasting system involves several practical challenges:

  • Data Formatting: Designing robust methods to serialize numerical time series data into a format digestible by LLMs is non-trivial. Choices include raw sequences, statistical summaries, trend descriptions, or even visualizations converted to text (e.g., using multimodal models or textual descriptions of charts). The optimal representation may vary depending on the LLM and the specific forecasting horizon. Integrating time-aligned news data requires robust NLP pipelines for filtering, summarizing, and relevance scoring.
  • Prompt Engineering/Fine-tuning Data: Crafting effective prompts for zero/few-shot inference is an iterative process requiring domain expertise. For fine-tuning, creating a high-quality dataset of (prompt, forecast, explanation) triples is labor-intensive and critical for model performance. The quality and style of the target explanations in the fine-tuning data will directly influence the model's output.
  • Computational Costs: Inference with large models like GPT-4 via APIs incurs costs per token and potential latency issues. Fine-tuning and hosting models like OpenLLaMA require significant computational resources (GPUs, memory) and MLOps infrastructure.
  • Latency: For real-time or high-frequency trading applications, the inference latency of large LLMs could be prohibitive. Even fine-tuned smaller models might struggle to meet stringent low-latency requirements compared to traditional models.
  • Hallucination and Reliability: LLMs can generate plausible but factually incorrect explanations (hallucinations). Ensuring the explanations reliably reflect the actual data and reasoning process is challenging. Evaluating the factual consistency and logical soundness of explanations requires careful human oversight or automated checks.
  • Evaluation of Explanations: Quantifying the "quality" of an explanation is inherently difficult. Metrics might include relevance (does it use the provided data?), coherence (is it logically sound?), and insightfulness (does it provide non-obvious connections?). This often requires subjective human evaluation.
  • Model Staleness: Financial markets evolve. Both the pre-trained knowledge of LLMs and the patterns learned during fine-tuning can become outdated, requiring periodic model updates or retraining with fresh data.

Conclusion

The paper "Temporal Data Meets LLM -- Explainable Financial Time Series Forecasting" (Yu et al., 2023 ) demonstrates the potential of using LLMs to address key challenges in financial forecasting, namely multi-modal data integration, cross-sequence reasoning, and explainability. By processing both numerical time series and textual news data, models like GPT-4 (zero-shot) and fine-tuned OpenLLaMA were shown to outperform traditional baselines, offering not just predictions but also natural language justifications. While the approach shows promise, practical implementation faces hurdles related to data representation, prompt engineering, computational cost, reliability, and the evaluation of explanation quality. The results suggest a trade-off between the performance of large proprietary models and the adaptability of fine-tuned open-source alternatives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Time-series clustering–a decade review. Information systems 53 (2015), 16–38.
  2. Handbook of financial time series. Springer Science & Business Media.
  3. Clustering approach to stock market prediction. International Journal of Advanced Networking and Applications 3, 4 (2012), 1281.
  4. Arash Bahrammirzaee. 2010. A comparative survey of artificial intelligence applications in finance: artificial neural networks, expert system and hybrid intelligent systems. Neural Computing and Applications 19, 8 (2010), 1165–1195.
  5. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  6. Language Models are Few-Shot Learners. ArXiv abs/2005.14165 (2020).
  7. A review of threshold time series models in finance. Statistics and its Interface 4, 2 (2011), 167–181.
  8. Correlated anomaly detection from large streaming data. In 2018 IEEE International Conference on Big Data (Big Data). IEEE, 982–992.
  9. Financial time series forecasting with multi-modality graph neural network. Pattern Recognition 121 (2022), 108218.
  10. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://vicuna.lmsys.org
  11. Cross-correlation dynamics in financial time series. Physica A: Statistical Mechanics and its Applications 388, 5 (2009), 705–714.
  12. Piet De Jong and Ben Zehnwirth. 1983. Claims reserving, state-space models and the Kalman filter. Journal of the Institute of Actuaries 110, 1 (1983), 157–181.
  13. Alexiei Dingli and Karl Sant Fournier. 2017. Financial time series forecasting-a deep learning approach. International Journal of Machine Learning and Computing 7, 5 (2017), 118–122.
  14. Feike C Drost and Theo E Nijman. 1993. Temporal aggregation of GARCH processes. Econometrica: Journal of the Econometric Society (1993), 909–927.
  15. Vijay Prakash Dwivedi and Xavier Bresson. 2020. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699 (2020).
  16. Jianqing Fan. 2005. A selective overview of nonparametric methods in financial econometrics. Statist. Sci. (2005), 317–337.
  17. Christian Francq and Jean-Michel Zakoian. 2004. Maximum likelihood estimation of pure GARCH and ARMA-GARCH processes. Bernoulli 10, 4 (2004), 605–637.
  18. Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An Open Reproduction of LLaMA. https://github.com/openlm-research/open_llama
  19. M Ghahramani and A Thavaneswaran. 2006. Financial applications of ARMA models with GARCH errors. The Journal of Risk Finance 7, 5 (2006), 525–543.
  20. Scaling and correlation in financial time series. Physica A: Statistical Mechanics and its Applications 287, 3-4 (2000), 362–373.
  21. A review of nonparametric time series analysis. International statistical review 65, 1 (1997), 49–72.
  22. MCMC-based estimation of Markov Switching ARMA–GARCH models. Applied Economics 43, 3 (2011), 259–271.
  23. Altaf Hossain and Mohammed Nasser. 2011. Comparison of the finite mixture of ARMA-GARCH, back propagation neural networks and support-vector machines in forecasting financial returns. Journal of Applied Statistics 38, 3 (2011), 533–551.
  24. Explainable multivariate time series classification: a deep neural network which learns to attend to important variables as well as time intervals. In Proceedings of the 14th ACM international conference on web search and data mining. 607–615.
  25. Deep learning with long short-term memory for time series prediction. IEEE Communications Magazine 57, 6 (2019), 114–119.
  26. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398 (2023).
  27. Søren Johansen. 1995. Likelihood-based inference in cointegrated vector autoregressive models. OUP Oxford.
  28. Norberto Ritzmann Júnior and Julio Cesar Nievola. 2018. A generalized financial time series forecasting model based on automatic feature engineering using genetic algorithms and support vector machine. In 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  29. Nont Kanungsukkasem and Teerapong Leelanupab. 2019. Financial latent Dirichlet allocation (FinLDA): Feature extraction in text and data mining for financial time series prediction. IEEE Access 7 (2019), 71645–71664.
  30. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems 30 (2017).
  31. Kyoung-jae Kim. 2003. Financial time series forecasting using support vector machines. Neurocomputing 55, 1-2 (2003), 307–319.
  32. Boris Kovalerchuk and Evgenii Vityaev. 2005. Data mining for financial applications. Data Mining and Knowledge Discovery Handbook (2005), 1203–1224.
  33. Financial time series forecasting with machine learning techniques: a survey.. In ESANN.
  34. An experimental review on deep learning architectures for time series forecasting. International Journal of Neural Systems 31, 03 (2021), 2130001.
  35. Tim Leung and Theodore Zhao. 2021. Financial time series analysis and forecasting with Hilbert–Huang transform feature generation and machine learning. Applied Stochastic Models in Business and Industry 37, 6 (2021), 993–1016.
  36. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858 (2022).
  37. Wei Li and Jian Liao. 2017. A comparative study on trend forecasting approach for stock price time series. In 2017 11th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID). IEEE, 74–78.
  38. Let’s Verify Step by Step. arXiv preprint arXiv:2305.20050 (2023).
  39. Junmei Ma and Xinli Yu. 2013. Research on the Pricing of the Basket Credit Default Swap. Mathematical Computation 2, 4 (2013).
  40. Time-series learning of latent-space dynamics for reduced-order model closure. Physica D: Nonlinear Phenomena 405 (2020), 132368.
  41. Alexey Natekin and Alois Knoll. 2013. Gradient boosting machines, a tutorial. Frontiers in neurorobotics 7 (2013), 21.
  42. Nam Nguyen and Brian Quanz. 2021. Temporal latent auto-encoder: A method for probabilistic multivariate time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 9117–9125.
  43. OpenAI. 2023a. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  44. OpenAI. 2023b. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  45. Ping-Feng Pai and Chih-Sheng Lin. 2005. A hybrid ARIMA and support vector machines model in stock price forecasting. Omega 33, 6 (2005), 497–505.
  46. Andrew J Patton. 2012. A review of copula models for economic time series. Journal of Multivariate Analysis 110 (2012), 4–18.
  47. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 (2023).
  48. Universal and nonuniversal properties of cross correlations in financial time series. Physical review letters 83, 7 (1999), 1471.
  49. Yuan Qi and Jing Xiao. 2018. Fintech: AI powers financial services to improve people’s lives. Commun. ACM 61, 11 (2018), 65–69.
  50. Sangeeta Rani and Geeta Sikka. 2012. Recent techniques of clustering of time series data: a survey. International Journal of Computer Applications 52, 15 (2012).
  51. Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems 33 (2020), 12559–12571.
  52. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Applied soft computing 90 (2020), 106181.
  53. On Efficient Training of Large-Scale Deep Learning Models: A Literature Review. arXiv preprint arXiv:2304.03589 (2023).
  54. Frank Smets and Raf Wouters. 2003. An estimated dynamic stochastic general equilibrium model of the euro area. Journal of the European economic association 1, 5 (2003), 1123–1175.
  55. Wen Song and Shigeru Fujimura. 2021. Capturing combination patterns of long-and short-term dependencies in multivariate time series forecasting. Neurocomputing 464 (2021), 72–82.
  56. Finite mixture of ARMA-GARCH model for stock price prediction. In Proceedings of the Third International Workshop on Computational Intelligence in Economics and Finance (CIEF’2003), North Carolina, USA. 1112–1119.
  57. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  58. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  59. LLaMA: Open and Efficient Foundation Language Models. ArXiv abs/2302.13971 (2023).
  60. Theodore B Trafalis and Huseyin Ince. 2000. Support vector machine for regression and applications to financial forecasting. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, Vol. 6. IEEE, 348–353.
  61. Chih F Tsai and Sammy P Wang. 2009. Stock price forecasting by hybrid machine learning techniques. In Proceedings of the international multiconference of engineers and computer scientists, Vol. 1. 60.
  62. Deep learning for stock market prediction from financial news articles. In 2017 IEEE international conference on computational intelligence and virtual environments for measurement systems and applications (CIVEMSA). IEEE, 60–65.
  63. Jar-Long Wang and Shu-Hui Chan. 2006. Stock market trading rule discovery using two-layer bias decision tree. Expert Systems with Applications 30, 4 (2006), 605–611.
  64. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
  65. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).
  66. Support vector machine regression for volatile stock market prediction. In Intelligent Data Engineering and Automated Learning—IDEAL 2002: Third International Conference Manchester, UK, August 12–14, 2002 Proceedings 3. Springer, 391–396.
  67. Do transformers really perform badly for graph representation? Advances in Neural Information Processing Systems 34 (2021), 28877–28888.
  68. Machine learning techniques and use of event information for stock market prediction: A survey and evaluation. In International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06), Vol. 2. IEEE, 835–841.
  69. Graph transformer networks. Advances in neural information processing systems 32 (2019).
  70. Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140 (2020).
  71. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923 (2023).
  72. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  73. Eric Zivot and Jiahui Wang. 2006. Vector autoregressive models for multivariate time series. Modeling financial time series with S-PLUS® (2006), 385–429.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinli Yu (6 papers)
  2. Zheng Chen (221 papers)
  3. Yuan Ling (7 papers)
  4. Shujing Dong (1 paper)
  5. Zongyi Liu (6 papers)
  6. Yanbin Lu (5 papers)
Citations (45)
Youtube Logo Streamline Icon: https://streamlinehq.com