FNSPID Financial News & Price Dataset
- FNSPID is a large-scale dataset integrating 29.7M stock price records with 15.7M sentiment-analyzed news articles covering 4,775 S&P 500 companies from 1999–2023.
- It employs advanced summarization algorithms and ChatGPT sentiment scoring to seamlessly align qualitative news with quantitative market data.
- Benchmarking shows transformer models achieve an R² of 0.988 with sentiment augmentation, underscoring the dataset’s impact on predictive accuracy.
The Financial News and Stock Price Integration Dataset (FNSPID) is a large-scale, time-aligned resource designed to facilitate integrated analysis of quantitative and qualitative financial information. Developed to support advanced predictive modeling in financial markets, FNSPID uniquely combines historical stock price data with systematically processed sentiment information from financial news. Its distinctive scale, coverage, and methodological rigor position it as a central resource for research in financial forecasting, sentiment analysis, risk modeling, and reinforcement learning within the finance domain.
1. Structure, Composition, and Sources
FNSPID consists of two principal components: numerical stock market records and sentiment-annotated financial news. Specifically, it includes 29.7 million stock price entries and 15.7 million time-aligned news articles, each mapped to corresponding market events. The dataset covers 4,775 S&P 500 companies over an extensive period (1999–2023), enabling longitudinal studies on market behavior and sentiment impact.
Numerical data are acquired via the Yahoo Finance API, supplying prices, trading volumes, and other financial metrics. Textual news records are sourced from multiple reputable platforms, including NASDAQ, Bloomberg, Reuters, Benzinga, and Lenta. This multi-source aggregation enhances both diversity and representativeness across major events and routine market fluctuations.
2. Sentiment Integration and Temporal Alignment
A distinctive feature of FNSPID is its systematic integration of sentiment data with quantitative records. To minimize token overhead and maximize informativeness, all articles are distilled using established summarization algorithms — LexRank, Luhn, Latent Semantic Analysis, and TextRank via the Sumy Python package. Each summary is constrained to approximately three sentences, optimizing for both relevance and compute efficiency.
Summarized texts are then annotated using ChatGPT, which assigns a sentiment score on a discrete 1 (negative) to 5 (positive) scale. On trading days with multiple articles, scores are averaged to yield a daily sentiment index. For missing-news days, sentiment is imputed by exponential decay:
where denotes the sentiment score at time , is the last available score, governs the decay rate, and $3$ is the neutral baseline. This approach maintains temporal continuity and mitigates data gaps, supporting robust time-series correlation analysis.
3. Predictive Modeling and Experimental Results
FNSPID has been subject to comprehensive benchmarking with both classical and modern prediction frameworks, including LSTM, RNN, GRU, TimesNet, and transformer-based architectures. Experimental results demonstrate:
- Transformer models achieve the highest prediction accuracy on integrated tasks, obtaining an of $0.988$ with sentiment augmentation.
- LSTM and GRU remain competitive but trail transformers, especially when incorporating sentiment features.
- Expanding the training set yields substantial improvements; transformer accuracy increases by 0.13 in when more data are included.
Sentiment scores supplied by ChatGPT outperform simpler alternatives (such as TextBlob), providing modest yet measurable gains in predictive performance. The integration of high-quality qualitative information with numerical data produces superior short-term market forecasts.
4. Construction Workflow, Reproducibility, and Tooling
FNSPID is constructed through a fully reproducible workflow:
- News articles are scraped and parsed from designated platforms.
- Price data are extracted via API.
- Preprocessing includes summarization with weighted importance assigned to sentences containing stock symbols (summarization weight: ).
- ChatGPT produces sentiment labels.
- Alignment merges news and prices on a per-day basis, using decay-imputed sentiment where applicable.
The entire procedure is documented and implemented in open-source code (github.com/Zdong104/FNSPID), enabling researchers to update or extend the dataset as new market data and news flows become available.
5. Applications Across Financial Research
FNSPID enables diverse research applications:
- Advanced machine learning for quantitative trading and portfolio optimization.
- Multimodal approaches fusing time-series with textual sentiment data.
- Analysis of sentiment-price relationships for market structure inference.
- Risk modeling, anomaly detection, and event-driven trading.
Empirical results illustrate that larger training datasets and sentiment augmentation together improve prediction accuracy, particularly for transformer-based models. This suggests ongoing opportunities for studies in market microstructure, behavioral finance, and event-driven risk management.
6. Cross-Paper Impact and Downstream Utility
FNSPID underpins several subsequent studies. In reinforcement learning contexts (e.g., (Zha et al., 9 May 2025)), pre-extracted LLM-based sentiment and risk metrics furnish reward signal enhancements that accelerate training and boost trading agent performance without increased resource overhead. In financial sentiment modeling (e.g., (Zhang et al., 16 Sep 2025)), FNSPID serves as the empirical foundation for constructing daily sentiment indices. Using econometric models (DCC-GARCH, Johansen cointegration), these analyses confirm that sentiment scores derived from FNSPID exhibit statistically significant long-run comovement with major stock indices ( typically $0.35$–$0.45$), substantiating the economic relevance of sentiment signals.
7. Significance and Research Opportunities
FNSPID establishes a scalable paradigm for integrated financial analysis, enabling large-scale multimodal learning, empirical validation of sentiment effects, and reproducible experimentation. Its construction methodology and impact on predictive accuracy highlight both the necessity and feasibility of combining qualitative and quantitative data for real-world financial forecasting and risk assessment. The resource provides a firm basis for ongoing research in high-frequency market prediction, algorithmic trading, and econometric analysis of sentiment-driven market phenomena.
A plausible implication is that the modular design and comprehensive documentation of FNSPID will facilitate future development of adaptive predictive models and expansion into related domains, including intraday analytics and regime-sensitive reinforcement learning.