FinMultiTime Dataset Overview

Updated 28 October 2025

FinMultiTime is a comprehensive, four-modal dataset integrating bilingual financial news, structured tables, technical chart images, and price time series for market analysis.
Its methodology features precise temporal alignment, multimodal fusion, and a reproducible preprocessing pipeline to ensure synchronized data across modalities.
The dataset supports both high-frequency and low-frequency analysis with extensive coverage from 2009 to 2025, enhancing predictive accuracy and cross-market studies.

FinMultiTime is a large-scale, four-modal, bilingual dataset tailored for financial time-series analysis, distinguished by its temporal alignment of financial news, structured financial tables, K-line technical chart images, and stock price time series across both the S&P 500 (United States) and HS300 (China) equity universes from 2009 to 2025. The dataset covers 5,586 stocks with a total size of 112.6 GB, supporting minute-level, daily, and quarterly resolutions. Its design enables comprehensive multimodal fusion for financial prediction, supporting advancements in modeling, pipeline reproducibility, and cross-market analysis (Xu et al., 5 Jun 2025).

1. Modality Structure and Temporal Alignment

FinMultiTime integrates four data modalities, each temporally synchronized per stock and interval:

Financial News (Text):
- Comprises raw and summarized articles in both English and Chinese.
- Sentiment labeling is performed via Latent Semantic Analysis (LSA) followed by GPT-4.1, producing sentiment scores (1–5 scale) per article per day.
- For each stock and trading day, only the article with the highest ticker frequency is retained.
Structured Financial Tables:
- Includes balance sheets, cash flow, and income/equity statements formatted in JSON/JSONL, sourced from SEC filings (S&P 500: 10-K/10-Q; HS300: analogous reports).
- Financial variables (e.g., net profit, operating cash flow) are aligned to reporting dates and forward-filled to match daily price observations.
K-line Technical Charts (Images):
- Candlestick chart images derived from daily open-high-low-close-volume (OHLCV) segments over semi-annual windows.
- Each image is converted to 8-bit grayscale; GPT-4.1 computes a long-term trend label (scale 1–5) encapsulating multimonth market direction.
Stock Price Time Series (Numerical):
- Stores normalized daily OHLCV data, available at minute-level granularity (for news), daily resolution (for price series), and quarterly (for fundamentals).
- These series are aligned such that the modal data (news, charts, tables) can be mapped to corresponding numerical movements.

Temporal alignment ensures that all modalities referring to a particular stock and date (or reporting period) are mutually coherent, supporting joint modeling for multimodal learning and forecasting.

2. Coverage, Scale, and Data Resolution

Market scope: 4,694 S&P 500 constituents (English), 892 HS300 constituents (Chinese); total stock coverage is 5,586.
Time window: 2009–2025, facilitating analysis over multiple market cycles and economic regimes.
Volume: 112.6 GB total; S&P 500 subset includes 14.1 GB of text and 8.67 GB of images; HS300 subset comprises 652.53 MB of news with proportionally extensive tables and chart images.
Resolution: Minute-level (for intraday signals/news), daily (prices), quarterly/annual (structured tables).

This scale enables modeling of both high-frequency (minute) and low-frequency (quarterly/annual) market signals, uniquely supporting fusion-based methods and thorough empirical evaluation.

3. Preprocessing Pipeline and Annotation

FinMultiTime is built via a fully reproducible, multi-stage pipeline:

Data acquisition: Utilizes APIs and web scraping for both equity universes, collecting filings, stock prices, news, and chart images.
Preprocessing: Includes normalization of price series, LSA-based news preprocessing, aggregation by ticker frequency, and temporal forward-filling for tables.
Annotation: Sentiment labeling (news), trend extraction (K-line images) via GPT-4.1.
Synchronization: All data modalities are temporally mapped to daily timestamps and reporting periods, allowing extended multimodal experiments.

This pipeline ensures seamless dataset updates, adaptability to new filings and price ticks, and supports extensions for future research.

4. Experimental Insights and Modeling Implications

Experiments in the dataset underscore several findings:

Scale effect and data quality: Expanded, cleaned, and synchronized multimodal data substantially decreases prediction error, with large datasets outperforming small or unimodal analogues.
Multimodal fusion: Combining signals (numerical prices, news sentiment, chart trends, and fundamental tables) yields moderate accuracy improvements in Transformer and GRU models; for example, GRU architectures demonstrate marked improvement in mean squared error when chart-based trend features are included.
Pipeline reproducibility: Regular dataset updates and coherent preprocessing steps foster ongoing research relevance and robust benchmarking.

A plausible implication is that further advances in multimodal fusion—potentially incorporating seasonal or cross-market patterns—could continue improving financial model accuracy, especially when large, temporally-aligned datasets are available.

5. Technical Details and LaTeX Formulas

The article introduces precise techniques for news summarization and sentence selection:

Stock relevance weight: $W_p(S, s) = \begin{cases} k, & \text{if } S \text{ contains } s \ 0, & \text{otherwise} \end{cases}$ where $S$ is a sentence, $s$ a stock symbol, and $k$ a weight constant.
Sentence length and importance weight: $W_\phi(S_\text{sum}, S_\text{long}) = \begin{cases} t, & S_\text{sum} \text{ is in longer sentences} \ 0, & \text{otherwise} \end{cases}$
Aggregate weight calculation: $W_z = W_p + W_\phi$

This weighting mechanism allows prioritization of sentences explicitly mentioning the stock and carrying ample context, thus improving sentiment analysis and downstream predictive accuracy.

6. Applications and Research Utility

FinMultiTime is positioned as a comprehensive resource for:

Financial forecasting and asset allocation: Supports models that integrate sentiments, technical trends, fundamentals, and price action.
Risk management: Enables anomaly detection by correlating sentiment shifts, technical reversals, and financial metrics with price volatility.
Cross-market and bilingual analysis: Facilitates studies involving both U.S. and Chinese equities, including cross-market correlation and translation-aware sentiment effects.
Large multimodal LLM fine-tuning: Its scale, multimodality, and temporal synchronization make it suitable for developing and fine-tuning financial foundation models designed for automated advisory, reporting, or reinforcement learning trading agents.

7. Significance and Comparative Perspective

FinMultiTime extends beyond traditional, unimodal financial datasets by introducing synchronized, multi-resolution, bilingual, multimodal signals over two major global markets. This comprehensive design enables robust fusion-based prediction, supports the evaluation of architectures requiring joint numeric, textual, and image inputs, and provides an extensible platform for ongoing research in financial modeling, cross-lingual finance, and temporal multimodal learning. The reproducible pipeline and granularity of alignment set a precedent for future benchmark development in financial machine learning.

In summary, FinMultiTime constitutes a substantial advance in dataset construction for multimodal financial time series analysis, providing aligned, large-scale, and richly annotated data streams to support state-of-the-art research and practical financial decision making (Xu et al., 5 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

FinMultiTime: A Four-Modal Bilingual Dataset for Financial Time-Series Analysis (2025)

Follow Topic

Get notified by email when new papers are published related to FinMultiTime Dataset.