Financial News and Stock Price Integration Dataset

Updated 20 December 2025

FNSPID is a comprehensive, multi-modal dataset that synchronizes financial news sentiment with stock price data across two decades.
It employs rigorous acquisition, cleaning, and temporal alignment protocols alongside lexicon and transformer-based sentiment scoring methods.
Empirical analyses demonstrate that integrating qualitative news cues with quantitative price signals enhances both short-term forecasting and long-run market trend analyses.

The Financial News and Stock Price Integration Dataset (FNSPID) is a large-scale, multi-modal resource designed for integrated analysis of financial news sentiment and quantitative stock market data. FNSPID synthesizes millions of time-aligned news records and stock price series from major public markets over more than two decades. It is engineered to enable rigorous research on the dynamic interplay between qualitative information (news, sentiment) and quantitative factors (prices, volumes, financials), serving as a foundational benchmark for financial forecasting using machine learning, econometrics, and LLMs.

1. Dataset Composition and Coverage

FNSPID aggregates both structured and unstructured financial data, including historical stock prices, full-text news articles, and derived sentiment scores. The dataset encompasses:

Stock price records: 29.7 million daily records for 4,775 S&P 500 companies, spanning January 1, 1999 to December 31, 2023.
News articles: 15.7 million time-stamped articles, sourced from NASDAQ (scraped real-time), Bloomberg, Reuters, Benzinga, and Lenta. Coverage includes English and Russian records.
Market indices: Auxiliary analyses include ETF/sector-level prices (VOO, ACWI, VTI, EFA, IWM, XLF), all aligned at daily granularity for periods such as 2018–2019.
Expanded modalities (for selected subcorpora): Quarter-level fundamental reports (income statements, balance sheets, cash flows), company metadata (sector, principal products/services), and press content from ≈30 additional business news sources (e.g., CNBC, Forbes, CNN) (Dong et al., 2024, Zhang et al., 16 Sep 2025, Elahi et al., 2024).

The resulting corpus provides both article-level and fully integrated time series for equities, with ≥ 3.7 years of dense coverage for Indian NIFTY50 stocks and 25 years for S&P 500 constituents (Srinivas et al., 2023, Dong et al., 2024).

2. Data Schema and Storage Formats

FNSPID maintains separate but joinable tables for prices and news, consistently aligned by ticker and trading date. Key schemas include:

Stock Price Table

date (YYYY-MM-DD)
ticker (e.g., AAPL)
open, high, low, close, adjusted_close (USD)
volume (integer)

News Table

timestamp (ISO 8601)
ticker(s) (list of strings)
source (identifier: NASDAQ, Bloomberg, etc.)
[URL](https://www.emergentmind.com/topics/unidirectional-reflection-lasing-url) (unique key)
headline (string)
body_text (string)
language (EN or RU)
summaries (outputs of LexRank, Luhn, TextRank, LSA, each 3 sentences)
sentiment_score (integer 1–5)

Linkage Table (Integrated View)

Per-day, per-ticker, storing OHLCV and multiple section-level sentiment scores (from VADER, Harvard IV-4, Loughran-McDonald) (Srinivas et al., 2023).

File formats:

Raw news: JSON per article.
Processed time series: Parquet (columnar), partitioned by ticker, with fallback CSVs for maximal interoperability.
Derived multi-modal prompt instances: JSONL (containing aligned news, financials, and outcome labels for LLM experimentation) (Elahi et al., 2024).

3. Data Acquisition, Cleaning, and Alignment Protocols

Acquisition:

Prices: Fetched via Yahoo! Finance API (pandas-datareader/yfinance), daily granularity, forward-filling missing values for weekends and holidays.
News:
- NASDAQ: Real-time scraping via Selenium.
- Bloomberg, Reuters, Benzinga, Lenta: Historical dumps, de-duplicated by URL and timestamp.
- Filtered for target tickers in headline/full text.
Financials: SEC EDGAR and company websites (for 10-K filings), joined on reporting dates (Elahi et al., 2024).

Preprocessing:

News cleaning: Strip HTML, remove control/non-printable characters, normalize Unicode, filter by language.
Deduplication: Remove exact duplicates by hash/timestamp; select earliest crawl time for conflicts.
Sentence segmentation: Sumy or rule-based for summary extraction and weighting; tokenize as required for downstream sentiment libraries.
Temporal alignment: News for date $t$ matched to returns or financial targets for $t$ (or, in some settings, $t-1$ to $t$ ).

Aggregation:

For multi-ticker articles, sentiment assignments are split or duplicated as needed.
Missing-news extrapolation for days with no news: $S_t = 3 + (S_0 - 3) e^{-0.03 t}$ (Dong et al., 2024).

4. Sentiment Representation and Scoring Methodologies

Sentiment scoring in FNSPID leverages domain-specific lexica, transformer-based models, and aggregation protocols:

Lexicon Approaches (NIFTY50 edition, (Srinivas et al., 2023)):

Section-level scores (headline, synopsis, full text) via:
- VADER: $S_{\text{vader}}(a,sec) = v / \sqrt{v^2 + \alpha}$ , with $v$ sum of valence scores, $\alpha=15$ .
- Harvard IV-4 and Loughran–McDonald: $S(a,sec) = (pos - neg) / (pos + neg)$ , default 0 if denominator vanishes.
Daily aggregation per ticker:

$S^L_{t,sec}(d) = \begin{cases} (1/N_t(d)) \sum_{a \in A_t(d)} S_L(a,sec), & N_t(d) > 0 \ 0, & \text{otherwise} \end{cases}$

Transformer/LLM Approaches (Dong et al., 2024, Zhang et al., 16 Sep 2025):

Summarization: Four extractive methods applied to body text; sentences mentioning the ticker weighted more heavily.
LLM-based scoring: ChatGPT (temperature=0) converts summaries to integer sentiment (1–5); deterministic outputs.
FinBERT-based time series (Zhang et al., 16 Sep 2025):
- Sentence-level class posteriors aggregated for each day.
- Daily sentiment index: $t$ 0, $t$ 1-scored over the window.
Normalization: Scores mapped onto $t$ 2, or kept as integer for model choice.

5. Integration with Stock Price and Financial Data

Time-series construction:

For each ticker and date:
- Assemble OHLCV + daily aggregate sentiment vector + financial fundamentals (if available).
- For LLM-based fusion, create prompt instance with:
- Most recent four quarters of financials.
- Top- $t$ 3 news chunks (retrieved using OpenAI embeddings and cosine similarity within 60-day window).
- Explicit target labels (e.g., forward 3/6-month returns: binary UP/DOWN) (Elahi et al., 2024).

Integration table example (simplified):

date	ticker	open	close	vad_head	vad_syn	vad_full	hiv4_head	hiv4_syn	lm_head	...	sentiment_score
2019-01-01	RELIANCE	1220.50	1230.75	0.00	0.00	0.00	0.00	0.00	0.00	...	0
2019-01-02	RELIANCE	1230.80	1240.55	0.1254	-0.0310	0.0758	0.1000	-0.0200	0.0800	...	1

For LLM prompt instances, multi-modal JSONs encode all financial and news context for downstream classification (Elahi et al., 2024).

6. Benchmarking, Econometric Linkage, and Empirical Findings

Forecasting experiments (Dong et al., 2024, Srinivas et al., 2023, Elahi et al., 2024):

Models: CNN, RNN, LSTM, GRU, Transformer, TimesNet, and LLMs (GPT-3/4, LLaMA-2/3).
Features: Purely quantitative, and with additional ChatGPT/FinBERT/VADER-based sentiment series.
Tasks: 3/6-month price movement classification (UP/DOWN), short-horizon forecasting (3-day ahead regression).
Findings:
- Transformer models achieve $t$ 4 on 50-stock subsets.
- Adding high-quality ChatGPT/FinBERT sentiment delivers consistent, though modest, boost in predictive $t$ 5 or F1 (Dong et al., 2024, Zhang et al., 16 Sep 2025).
- Simple sentiment from off-the-shelf (TextBlob) can degrade performance; model/lexicon quality is critical.

Market linkage analysis (Zhang et al., 16 Sep 2025):

DCC-GARCH: Joint dynamics of daily returns and standardized sentiment, showing highly persistent and economically significant ( $t$ 6) correlations.
Johansen cointegration: Cumulative sentiment and log-price exhibit one significant cointegrating relationship, confirming a stable long-run equilibrium.

Empirical implication: News-driven sentiment as measured by FNSPID provides predictive and explanatory signals for both short-run market activity and long-run equity trends.

7. Access, Licensing, and Reproducibility

Public code and data: GitHub repositories for complete workflows, documentation, and sample notebooks ((Dong et al., 2024): https://github.com/Zdong104/FNSPID, (Elahi et al., 2024): https://github.com/XXXX/FNSPID).
Data files: Article-level CSV/JSON, daily ticker-aligned time series in Parquet.
Access patterns: Efficient querying enabled by partitioning (e.g., by ticker/date for Parquet), compatible with pandas, pyarrow, or Spark environments.
Licensing: CC BY-SA 4.0 for major datasets (Elahi et al., 2024).
Reproducibility: All preprocessing, scraping, alignment, sentiment calculation, and modeling steps detailed in code, with versioned update pipelines and fixed random seeds (Dong et al., 2024).

FNSPID thus constitutes a comprehensive, extensible, and empirically validated resource for financial machine learning, econometric analysis, and multi-modal LLM research (Dong et al., 2024, Elahi et al., 2024, Zhang et al., 16 Sep 2025, Srinivas et al., 2023).