MarketCalls Dataset: Finance & Marketing

Updated 21 November 2025

MarketCalls dataset is a dual-modality resource comprising a financial text corpus for numerical claim detection and a Mandarin audio benchmark for outbound marketing analysis.
The financial component, MarketCalls v1.0, is extracted from analyst reports and earnings call transcripts using custom preprocessing and rigorous annotation protocols, ensuring high inter-annotator agreement.
The marketing audio component leverages advanced segmentation, augmentation, and multi-modal feature extraction to accurately classify purchase propensity in outbound sales calls.

The term “MarketCalls dataset” refers to two distinct, high-value resources in financial and behavioral modeling: (1) a publicly released, sentence-level corpus for numerical claim detection in finance ("MarketCalls v1.0") and (2) a proprietary Mandarin speech benchmark for purchase propensity classification in outbound marketing calls. Both datasets are foundational for research in their respective domains, enabling quantitative NLP, multi-modal fusion, and robust downstream modeling tasks. The following sections document their construction, schema, evaluation protocols, and access modalities, referencing the respective sources (Shah et al., 2024) and (Liu et al., 14 Nov 2025).

1. Dataset Overview and Variants

MarketCalls, as described in Shah et al. (2024) (Shah et al., 2024) and Liu et al. (2025) (Liu et al., 14 Nov 2025), denotes two principal datasets:

MarketCalls v1.0 (Finance, Public Release): A CC BY 4.0 licensed corpus for numerical claim detection, assembled from analyst reports and NASDAQ-100 earnings-call transcripts.
MarketCalls (Mandarin Marketing Calls, Proprietary): A Mandarin audio dataset encapsulating 877 outbound sales calls, each annotated with purchase propensity and conversational segments, available by request for academic research.

These datasets differ in modality (text vs. audio-text), domain (financial reporting vs. customer engagement), and accessibility (public vs. controlled access).

2. Data Sources and Collection Methodology

MarketCalls v1.0 (Finance)

Analyst Reports: Sourced from Zacks Equity Research under Nexis Uni, spanning 1,530 public firms (Q1 2017–Q4 2020), 87,536 documents, ~167 million tokens.
Earnings-Call Transcripts: Collected from NASDAQ-100 investor-relations pages (Q1 2017–Q1 2023), 1,085 transcripts, ~11.6 million tokens.
Market Metadata: Incorporates daily stock prices (Polygon.io), EPS forecasts (I/B/E/S), and sector codes (Compustat GSECTOR).

MarketCalls (Mandarin, Marketing Audio)

Phone Calls: 877 automated outbound calls in Mandarin, spanning three domain contexts: dental care (490), beauty/cosmetics (229), and paid courses (158).
Annotation: Each call labeled by senior sales managers according to a five-tier purchase propensity taxonomy: Very Positive, Receptive, Impatient/Polite Negative, Explicit Refusal, and Not Relevant.

3. Preprocessing, Annotation Schemas, and File Organization

Preprocessing—MarketCalls v1.0

Sentence Splitting: Custom regex to resolve abbreviations and numeric contexts.
Numeric Filtering: Retain sentences with digits plus purpose currency/percent symbols.
Financial-Term Filtering: Apply an 8,200-term financial dictionary; drop sentences without dictionary hits.

Corpus	Raw Sentences	Numeric Sentences	Numeric+Financial	In-Claim Subset
Analyst Reports	8,583,093	2,857,567	2,364,977	336,252
Earnings Calls	48,686	41,013	1,233	1,233

Annotation Schema:

Definition: Numerical financial sentences must include a number, a currency/percent symbol, and a financial term.
Labels: “In-claim” (speculative forecast) vs. “Out-of-claim” (established fact).
Annotators: Two financial experts per sentence; disagreements resolved by a PhD-level adjudicator.
Agreement: Analyst reports 99.21%, earnings calls 95.78%.
File Format: JSON per sentence, fields include: doc_id, sector, date, sentence, numeric_flag, financial_flag, label.

Preprocessing and Segmentation—MarketCalls (Audio)

Audio Processing: Mono WAV, 8 kHz, 16-bit PCM; segmentation into salesperson/customer “rounds” (max 40 s per segment).
Feature Extraction:
- Audio: Wav2Vec conv frontend, HuBERT transformer encoder ( $L_a=999$ frames $\times d_a=768$ ).
- Text: iFlytek ASR transcript, RoBERTa encoder ( $L_t=199$ , $d_t=768$ ).
Data Augmentation:
- Audio: Gaussian noise, speed perturbation, SpecAugment masking.
- Text: Homophone substitutions (10% probability per character).

Split	Calls	Segments	Audio-Augmented	Text-Augmented
Train	701	2,293	7,041	4,586
Validation	88	321	—	—
Test	88	328	—	—

4. Labeling Taxonomies and Classification Tasks

MarketCalls v1.0 (Finance)

Claim Labels: “In-claim” (speculative forecast, e.g., "Revenue is expected to reach $5 billion"), “Out-of-claim” (established fact, e.g., "Revenue was$4.39 billion").
Annotation:
- Sector-balanced sampling: 2 analyst reports per sector per year; 2 earnings calls per year.

MarketCalls (Mandarin Marketing Calls)

Purchase Propensity Classes:
- Very Positive (“A”)
- Neutral/Receptive (“B”)
- Impatient/Polite Negative (“C”)
- Explicit Refusal (“D”)
- Not Relevant (“E”) — not present in final release.

Label Class	Calls	Proportion (%)
A	137	15.6
B	271	30.9
C	379	43.2
D	90	10.3
E	0	—

5. Evaluation Protocols and Metrics

MarketCalls v1.0 (Finance)

Weak-Supervision Aggregation Function:
- $K$ labeling functions $\lambda_j$ with values in $\{-1, 0, 1, 2\}$ .
- Logic: If any $\lambda_j(x) = -1$ , label is “out_of_claim”; else if $\max_j \lambda_j(x) = 2$ , label is “in_claim”; else, majority vote on non-abstain labels.
Optimism Measure (per doc $i$ ):

$\mathrm{Optimism}_i = 100 \times \frac{\mathrm{Pos.InClaim}_i - \mathrm{Neg.InClaim}_i}{\mathrm{TotalSentences}_i}$

(Sentiment computed via fine-tuned FinBERT.)

Metrics:
- Precision $P = \mathrm{TP}/(\mathrm{TP}+\mathrm{FP})$
- Recall $R = \mathrm{TP}/(\mathrm{TP}+\mathrm{FN})$
- Macro F $_1 = 2PR/(P+R)$
- Accuracy $= (\mathrm{TP}+\mathrm{TN})/(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN})$

MarketCalls (Marketing Audio)

Classification Tasks:
- $k$ -way accuracy (2, 3, 4, 5 classes): ACC $_k = \frac{1}{N} \sum_{i=1}^N \mathbf{1}(\hat y_i = y_i)$
- Additional metrics in related benchmarks: precision, recall, F $_1$ , mean absolute error, Pearson correlation, unweighted average recall (UAR).

6. Access, Licensing, and Reproducibility

MarketCalls v1.0: Publicly released under CC BY 4.0, with code, filtered JSON/CSV, and reproducibility scripts to be made available on GitHub/Hugging Face as per (Shah et al., 2024).
MarketCalls (Mandarin Audio): Access by request for non-commercial research; files include mono-WAV audio, UTF-8 transcripts, and segment-level metadata. Licensing details confirmed by South China Normal University, contact: hongyuliu@…, ruijiewan@… (Liu et al., 14 Nov 2025). Accompanying codebase is available at https://github.com/david188888/[MSMT-FN](https://www.emergentmind.com/topics/multi-segment-multi-task-fusion-network-msmt-fn).

7. Benchmarking and Downstream Modeling

MarketCalls v1.0: Used to benchmark claim detection models and sentiment-based predictive indicators for market returns.
MarketCalls (Marketing Audio): Empirical evaluations in MSMT-FN (Liu et al., 14 Nov 2025) compare state-of-the-art multimodal models:
- MSMT-FN achieves ACC $_3 = 63.83$ %, ACC $_4 = 61.70$ %, ACC $_5 = 60.28$ %—substantially outperforming the MMML baseline in fine-grained classification.
- MSMT-FN employs multi-segment fusion and multi-task learning across all granularities of customer intent.

In summary, the MarketCalls datasets represent domain-specific benchmarks designed for robust modeling of financial reporting claims and real-world marketing dialogues. They provide structured splits, rigorous annotation, and evaluation protocols suited to both NLP and multi-modal audio-text research settings.

PDF Markdown Chat (Pro)

References (2)

Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis (2024)

MSMT-FN: Multi-segment Multi-task Fusion Network for Marketing Audio Classification (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MarketCalls Dataset.

MarketCalls Dataset: Finance & Marketing

1. Dataset Overview and Variants

2. Data Sources and Collection Methodology

MarketCalls v1.0 (Finance)

MarketCalls (Mandarin, Marketing Audio)

3. Preprocessing, Annotation Schemas, and File Organization

Preprocessing—MarketCalls v1.0

Preprocessing and Segmentation—MarketCalls (Audio)

4. Labeling Taxonomies and Classification Tasks

MarketCalls v1.0 (Finance)

MarketCalls (Mandarin Marketing Calls)

5. Evaluation Protocols and Metrics

MarketCalls v1.0 (Finance)

MarketCalls (Marketing Audio)

6. Access, Licensing, and Reproducibility

7. Benchmarking and Downstream Modeling

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics