FNSPID: A Comprehensive Financial News Dataset in Time Series (2402.06698v1)
Abstract: Financial market predictions utilize historical data to anticipate future stock prices and market trends. Traditionally, these predictions have focused on the statistical analysis of quantitative factors, such as stock prices, trading volumes, inflation rates, and changes in industrial production. Recent advancements in LLMs motivate the integrated financial analysis of both sentiment data, particularly market news, and numerical factors. Nonetheless, this methodology frequently encounters constraints due to the paucity of extensive datasets that amalgamate both quantitative and qualitative sentiment analyses. To address this challenge, we introduce a large-scale financial dataset, namely, Financial News and Stock Price Integration Dataset (FNSPID). It comprises 29.7 million stock prices and 15.7 million time-aligned financial news records for 4,775 S&P500 companies, covering the period from 1999 to 2023, sourced from 4 stock market news websites. We demonstrate that FNSPID excels existing stock market datasets in scale and diversity while uniquely incorporating sentiment information. Through financial analysis experiments on FNSPID, we propose: (1) the dataset's size and quality significantly boost market prediction accuracy; (2) adding sentiment scores modestly enhances performance on the transformer-based model; (3) a reproducible procedure that can update the dataset. Completed work, code, documentation, and examples are available at github.com/Zdong104/FNSPID. FNSPID offers unprecedented opportunities for the financial research community to advance predictive modeling and analysis.
- Effect of training set size on SVM and Naive Bayes for Twitter sentiment analysis. 2015 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) (2015), 46–51. https://doi.org/10.1109/ISSPIT.2015.7394379
- Daily market news sentiment and stock prices. Applied Economics 51, 30 (2019), 3212–3235. https://doi.org/10.1080/00036846.2018.1564115
- Digital Stereotypes in HMI and mdash The Influence of Feature Quantity Distribution in Deep Learning Models Training. Sensors 22, 18 (2022). https://doi.org/10.3390/s22186739
- Stock price prediction: comparison of different moving average techniques using deep learning model. Neural Computing and Applications Volume 33, Issue 5 (2024), 1–18. https://doi.org/10.1007/s00521-023-09369-0
- Financial time-series data analysis using deep convolutional neural networks. In 2016 3rd International Conference on Systems and Informatics (ICSAI). IEEE, 924–929.
- Nai-Fu Chen. 1983. Some Empirical Tests of the Theory of Arbitrage Pricing. The Journal of Finance 38, 5 (1983), 1393–1414. http://www.jstor.org/stable/2327577
- SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs and News. https://doi.org/10.18653/v1/S17-2089
- Stock Price Prediction using Sentiment Analysis and Deep Learning for Indian Markets. arXiv:2204.05783 [q-fin.ST]
- The legacy of modern portfolio theory. The journal of investing 11, 3 (2002), 7–22.
- Eugene F Fama and Kenneth R French. 1992. The cross-section of expected stock returns. The Journal of Finance 47, 2 (1992), 427–465.
- Leveraging Latent Economic Concepts and Sentiments in the News for Market Prediction. https://consensus.app/papers/leveraging-latent-economic-concepts-sentiments-news-farimani/802f15acfcd75b2b8514e7bc4b7377a7/?utm_source=chatgpt 21867 news with headline and news content included for currency (including cryptocurrency) exchange rate news. Eg USDJPY, BTCUSD.
- Transforming sentiment analysis in the financial domain with ChatGPT. Machine Learning with Applications 14 (2023), 100508.
- Rubi Gupta and Min Chen. 2020. Sentiment Analysis for Stock Price Prediction. , 213-218 pages. https://doi.org/10.1109/MIPR49039.2020.00051
- News sentiment and stock market volatility. Review of Quantitative Finance and Accounting 57 (2021), 1093–1122.
- Mohd Naim Mohd Ibrahim and Mohd Zaliman Mohd Yusoff. 2017. The impact of different training data set on the accuracy of sentiment classification of Naïve Bayes technique. In 2017 IEEE Conference on Open Systems (ICOS). 17–20. https://doi.org/10.1109/ICOS.2017.8280267
- ChatGPT: Jack of all trades, master of none. Information Fusion (2023), 101861.
- A network and machine learning approach to factor, asset, and blended allocation. The Journal of Portfolio Management 46, 6 (2020), 54–71.
- A Comprehensive Comparative Study of Artificial Neural Network (ANN) and Support Vector Machines (SVM) on Stock Forecasting. Annals of Data Science Volume 10, Issue 1 (2023), 183–208. https://doi.org/10.1007/s40745-021-00344-x
- Jae Won Lee. 2001. Stock price prediction using reinforcement learning. In ISIE 2001. 2001 IEEE International Symposium on Industrial Electronics Proceedings (Cat. No.01TH8570), Vol. 1. 690–695 vol.1. https://doi.org/10.1109/ISIE.2001.931880
- Stock Prediction Based on Deep Learning and its Application in Pairs Trading. In 2022 International Symposium on Networks, Computers and Communications (ISNCC). 1–7. https://doi.org/10.1109/ISNCC55209.2022.9851776
- Yang Li and Yi Pan. 2020. A novel ensemble deep learning model for stock prediction based on stock prices and news. https://consensus.app/papers/novel-learning-model-stock-prediction-based-stock-prices-li/8b3afff9cf6d5073aa99142d106d2ec6/?utm_source=chatgpt Not a dataset, but it shows the sentiment information + stock price can make the prediction better..
- Investigating Deep Stock Market Forecasting with Sentiment Analysis. Entropy 25 (2023). https://doi.org/10.3390/e25020219
- Dynamic Datasets and Market Environments for Financial Reinforcement Learning. Machine Learning - Springer Nature (2024).
- Alejandro Lopez-Lira and Yuehua Tang. 2023. Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models. arXiv:2304.07619 [q-fin.ST]
- Data Augmentation for Sentiment Analysis Using Sentence Compression-Based SeqGAN With Data Screening. IEEE Access 9 (2021), 99922–99931. https://doi.org/10.1109/ACCESS.2021.3094023
- Sentence-Level Sentiment Analysis of Financial News Using Distributed Text Representations and Multi-Instance Learning. https://consensus.app/papers/sentencelevel-sentiment-analysis-financial-news-using-lutz/ec1a20b55e835dfea090a966be42768d/?utm_source=chatgpt 1000 sentiment labeled news. No timestamp..
- LEANDRO S Maciel and Rosângela Ballini. 2008. Design a neural network for time series financial forecasting: Accuracy and robustness analysis. Anales do 9º Encontro Brasileiro de Finanças, Sao Pablo, Brazil (2008).
- Algorithm and modeling of stock prices forecasting based on long short-term memory (LSTM). ICIC Express Letters (2018).
- Terry Lingze Meng and Matloob Khushi. 2019. Reinforcement Learning in Financial Markets. Data 4, 3 (2019). https://doi.org/10.3390/data4030110
- Stock Price Prediction Using News Sentiment Analysis. In 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 205–208. https://doi.org/10.1109/BigDataService.2019.00035
- OpenAI. 2023. ChatGPT. https://openai.com/chatgpt Oct 12, 2023.
- Xiao Ding Philippe Remy. 2015. Financial News Dataset from Bloomberg and Reuters. https://github.com/philipperemy/financial-news-dataset.
- I. Qudah and F. Rabhi. 2016. News Sentiment Impact Analysis (NSIA) Framework. https://consensus.app/papers/news-sentiment-impact-analysis-nsia-framework-qudah/cd2fd31ffc8052eda8fe3a637a35ec49/?utm_source=chatgpt Not a dataset, it introduced how should the sentiment dataset been build up as..
- M. Riyadh and M. O. Shafiq. 2022. GAN-BElectra: Enhanced Multi-class Sentiment Analysis with Limited Labeled Data. Applied Artificial Intelligence 36 (2022). https://doi.org/10.1080/08839514.2022.2083794
- Dynamic connectedness between investors’ sentiment and asset prices: A comparison between major markets in Europe and USA. Journal of International Financial Markets, Institutions and Money 89 (2023), 101866. https://doi.org/10.1016/j.intfin.2023.101866
- William F Sharpe. 1964. Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk. Journal of Finance 19 (1964), 425–442.
- Deep learning with gated recurrent unit networks for financial sequence predictions. Procedia Computer Science 131 (2018), 895–903.
- Dhruhi Sheth and Manan Shah. 2023. Predicting stock market using machine learning: best and accurate way to know future stock prices. International Journal of System Assurance Engineering and Management Volume 14, Issue 1 (2023), 1–18. https://doi.org/10.1007/s13198-022-01811-1
- Zhongyu Shi. 2023. Layout guide for Journal of Physics: conference series using Microsoft Word. 12509 (2023), 125090M – 125090M–6. https://doi.org/10.1117/12.2655886
- SEntFiN 1.0: Entity‐aware sentiment analysis for financial news. https://consensus.app/papers/sentfin-entity‐aware-sentiment-analysis-news-sinha/39969235e7ed532a9a2f0f813bd132 Fine-grained financial sentiment analysis on news headlines is a challenging task requiring human-annotated datasets to achieve high performance..
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Keenan Venuti. 2021. Predicting Mergers and Acquisitions using Graph-based Deep Learning. ArXiv abs/2104.01757 (2021).
- Quan Vu. [n. d.]. Finnhub Stock APIs. https://finnhub.io/. Accessed: Jan. 14, 2024.
- Exploring mutual information-based sentimental analysis with kernel-based extreme learning machine for stock prediction. Soft Computing 21 (2017), 3193–3205. https://doi.org/10.1007/s00500-015-2003-z
- FinGPT: Instruction Tuning Benchmark for Open-Source Large Language Models in Financial Datasets. arXiv:2310.04793 [cs.CL]
- TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. arXiv:2210.02186 [cs.LG]
- Adaptive stock trading strategies with deep reinforcement learning methods. Information Sciences 538 (2020), 142–158. https://doi.org/10.1016/j.ins.2020.05.066
- Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. In Proceedings of the First ACM International Conference on AI in Finance (New York, New York) (ICAIF ’20). Association for Computing Machinery, New York, NY, USA, Article 31, 8 pages. https://doi.org/10.1145/3383455.3422540
- Dmitry Yutkin. 2019. Corpus of news articles of Lenta.Ru. https://github.com/yutkin/Lenta.Ru-News-Dataset. Accessed: 12/30/2023.
- Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models. arXiv preprint arXiv:2306.12659 (2023).
- One Fits All:Power General Time Series Analysis by Pretrained LM. arXiv:2302.11939 [cs.LG]
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Practical Applications
Immediate Applications
Below are practical applications that can be deployed now, based directly on the dataset, findings, and methods presented in the paper.
- Financial modeling benchmarks and baselines (Academia, Finance)
- Use FNSPID to benchmark time-series and multimodal models (LSTM, GRU, Transformer, TimesNet) against numerical-only baselines and sentiment-enhanced inputs.
- Workflow: ingest aligned price-news sequences; train/validate models using the paper’s splits and feature sets; compare MAE/MSE/R² across architectures.
- Assumptions/dependencies: access to the 30+GB dataset; standard DL toolchain (PyTorch/TensorFlow); reproducible scripts from the provided GitHub; awareness that transformer models benefit more from larger data.
- Sentiment-augmented trading signals and risk dashboards (Finance)
- Add the 1–5 sentiment factor and the exponential sentiment decay to daily models for short-horizon signals, risk alerts, and volatility forecasting.
- Workflow: compute daily sentiment features (with decay and averaging for multiple articles); integrate with price/volume features; backtest on 1999–2023 coverage.
- Assumptions/dependencies: sentiment labels exist only for a subset (402K items); labeling quality matters (ChatGPT > TextBlob in this study); model hyperparameters require tuning; moderate to high compute resources.
- Event studies and anomaly detection (Finance, Academia)
- Identify abnormal returns around high-sentiment or high-volume news events, spot pre-crisis news patterns (e.g., 2008–2009), and detect sentiment-price divergences.
- Workflow: tag events by sentiment spikes; run abnormal return/event-window analyses; design anomaly detectors for news-pattern precursors.
- Assumptions/dependencies: accurate time alignment; controls for confounders; publication delays may dilute immediate impact.
- Factor research and portfolio construction with sentiment (Finance)
- Construct a “sentiment factor” to augment standard factors (e.g., Fama-French) for cross-sectional return models or portfolio tilts.
- Workflow: build daily firm-level sentiment scores; test factor exposures; incorporate into portfolio optimizer; monitor Sharpe/turnover.
- Assumptions/dependencies: stationarity challenges; sentiment may be more impactful in certain sectors; transaction costs may erode improvements.
- Model selection heuristics for practitioners (Finance, Academia)
- Apply empirical guidance: transformers outperform others at scale; LSTM/GRU are competitive on smaller samples; sentiment helps transformers modestly.
- Workflow: choose architectures based on data availability; add sentiment for transformer pipelines; for smaller datasets, prefer LSTM/GRU.
- Assumptions/dependencies: performance contingent on hyperparameters; sentiment preprocessing quality; coverage bias.
- Multilingual sentiment and coverage analysis (Academia)
- Use English/Russian splits to study cross-lingual market impact and global news coverage patterns.
- Workflow: perform language-stratified analyses; test model robustness across languages; build language-specific sentiment features.
- Assumptions/dependencies: varying article density by language; translation quality if needed; domain-specific lexicon differences.
- RL strategy prototyping with aligned multimodal data (Finance, Software)
- Train reinforcement learning agents with synchronized price-volume-sentiment states for daily decision-making.
- Workflow: adapt FinRL/FinGPT pipelines; include sentiment as state feature; evaluate reward structures (PnL, risk-adjusted returns).
- Assumptions/dependencies: RL stability and overfitting risks; careful reward design; realistic transaction/slippage modeling.
- Retail investor tools: sentiment overlays and digest (Daily life, Fintech)
- Build watchlist apps that summarize and score daily news per ticker using provided summarization (LSA/LexRank/TextRank) and sentiment scales.
- Workflow: daily fetch from FNSPID; display sentiment trends and summaries; alert on significant sentiment shifts.
- Assumptions/dependencies: coverage throttled by licensing and scraping constraints; avoid overreliance on short-term sentiment for long-term decisions.
- Corporate communications and IR monitoring (Industry)
- Monitor how external news sentiment correlates with stock movements for internal PR/IR strategy and crisis response.
- Workflow: sentiment trend dashboards; correlation heatmaps; post-event attribution.
- Assumptions/dependencies: sentiment scores reflect public narratives but not internal fundamentals; lag between news release and market reaction.
- Market integrity surveillance (Policy, Regulators)
- Detect potential manipulation by analyzing abnormal sentiment clusters followed by unusual price-volume patterns.
- Workflow: rules/ML for sentiment-price anomalies; flag investigations; link to enforcement workflows.
- Assumptions/dependencies: need domain thresholds; separating legitimate narratives from manipulation; governance over automated flags.
Long-Term Applications
These opportunities likely require further research, scaling, engineering, or policy development before broad deployment.
- Real-time, scalable multimodal pipelines (Finance, Software)
- Build streaming ingestion that continuously aligns live news with tick-level/daily prices and produces on-the-fly sentiment features.
- Potential products: “Live Sentiment Factor” feeds; real-time dashboards for traders.
- Assumptions/dependencies: robust, licensed news feeds; API reliability; low-latency summarization/sentiment models; cost control.
- Domain-specific financial LLMs that natively handle numbers (Finance, AI)
- Fine-tune LLMs (e.g., FinGPT variants) on FNSPID to jointly reason over text and time series for forecasting, risk commentary, and decision support.
- Potential tools: multimodal transformer architectures; retrieval-augmented generation for financial contexts.
- Assumptions/dependencies: high compute; careful training to avoid hallucinations; numeric reasoning fidelity; updated datasets.
- Advanced RL for portfolio and execution using market news (Finance)
- Integrate market microstructure, risk constraints, and sentiment-aware state spaces to optimize dynamic allocation and trade execution.
- Potential workflows: hierarchical RL; multi-agent simulation; risk-aware reward shaping.
- Assumptions/dependencies: synthetic market environments; robust generalization; compliance with trading regulations.
- Cross-lingual and cross-market expansion (Global finance, Academia)
- Extend to more languages, asset classes (bonds, commodities, crypto), and regions; enable comparative studies of news regimes.
- Potential products: global sentiment indices; sectoral impact maps (e.g., healthcare, energy, tech).
- Assumptions/dependencies: localized news sources; diverse compliance requirements; translation/normalization accuracy.
- Better sentiment labeling via domain-tuned models and human-in-the-loop (Academia, Industry)
- Replace generic labeling with finance-specific sentiment models; add human review for edge cases and high-stakes events.
- Potential tools: weak supervision, active learning, confidence scoring; event-type classifiers (earnings, M&A, litigation, regulation).
- Assumptions/dependencies: labeling budget; improved ontologies; avoiding bias; maintaining temporal consistency.
- Policy analytics: stress testing and macroprudential early warning (Policy)
- Use aggregated sentiment indices to stress test sectors/markets and inform macroprudential decisions (e.g., circuit breakers, disclosure timing).
- Potential workflows: sentiment shock scenarios; systemic risk dashboards; supervisory analytics.
- Assumptions/dependencies: validated link from sentiment to macro outcomes; careful communication to avoid procyclicality; governance for public use.
- Compliance, ESG, and reputational risk measurement (Policy, Finance)
- Track ESG-related news sentiment per issuer; quantify reputational risk and assess materiality for disclosure and compliance.
- Potential tools: ESG sentiment scoring; topic attribution; regulatory reporting integrations.
- Assumptions/dependencies: reliable topic classification; evolving ESG taxonomies; heterogeneous regulatory standards.
- Education and curriculum integration (Education)
- Develop hands-on courses and competitions around FNSPID for students to learn time-series ML, sentiment analysis, and financial modeling.
- Potential products: course modules, Kaggle-style challenges, model labs.
- Assumptions/dependencies: managed compute for classrooms; simplified subsets of data; ethical training on scraping and data use.
- Investor decision support assistants (Daily life, Fintech)
- Create consumer-grade AI advisors that explain news impact on portfolios, provide scenario analysis, and synthesize multi-source narratives.
- Potential tools: explainable AI components; guardrails; preference modeling.
- Assumptions/dependencies: strict compliance (suitability, disclaimers); reliable personalization; prevention of overconfidence.
- Market manipulation detection standards (Policy, Industry)
- Collaborate to define standards for sentiment-price anomaly detection, audit trails, and transparent explainability in financial AI.
- Potential products: shared benchmarks; regulatory sandboxes; audit toolkits.
- Assumptions/dependencies: multi-stakeholder cooperation; privacy safeguards; periodic model audits.
General Assumptions and Dependencies Across Applications
- Data licensing and ethics: scraping must respect robots.txt, copyrights, and regional policies; using third-party APIs (e.g., ChatGPT) entails cost and compliance.
- Coverage and bias: dataset drawn from selected sources (e.g., NASDAQ, Reuters, Bloomberg, Benzinga, Lenta) may reflect source biases; survivorship bias and symbol coverage should be checked.
- Sentiment quality: summarization (Sumy LSA/LexRank/TextRank/Luhn) reduces tokens but may lose nuance; ChatGPT-labeled scores outperform TextBlob here but still have variability; decay model assumes diminishing impact toward neutral.
- Model sensitivity: transformer gains are modest from sentiment and larger with more data; hyperparameter tuning materially affects outcomes; high accuracy ceilings limit incremental gains.
- Compute and MLOps: training at scale needs GPU resources and disciplined pipelines; real-time deployments need robust engineering.
- Temporal alignment: news timing vs. market reaction lag must be modeled; batching multiple articles per day via averaging can mask asymmetric impacts.
Collections
Sign up for free to add this paper to one or more collections.