MSP-Podcast Dataset Overview

Updated 18 August 2025

MSP-Podcast Dataset is a multimodal collection of over 100,000 English podcast episodes featuring complete audio, word-level timestamped transcripts, and rich metadata.
It supports diverse research tasks such as segment retrieval, abstractive summarization, speech emotion recognition, and expressive voice conversion using advanced deep learning methods.
Its structured segmentation, precise annotations, and comprehensive metadata enable innovative evaluation strategies and significantly enhance podcast content analysis.

The MSP-Podcast Dataset is a large-scale, multimodal corpus of over 100,000 English-language podcast episodes, released in early 2020 in support of the TREC Podcast Track. It is central to research in spoken content retrieval, summarization, natural language processing, voice conversion, and speech emotion recognition. Each episode is accompanied by automatic transcripts, rich metadata, and full audio files, making the dataset suitable for a range of deep learning and information retrieval tasks. The dataset’s structure and contents have driven the development of new methodologies in segment retrieval, abstractive summarization, emotion recognition, and expressive speech generation.

1. Composition and Structure

The MSP-Podcast dataset contains just over 100,000 podcast episodes with the following multimodal data for each episode (Jones et al., 2021):

Audio files: Complete recordings, representative of spontaneous conversational speech with diverse backgrounds and audio qualities.
Automatic transcripts: Word-level transcriptions generated with Google’s Speech-to-Text API in early 2020, with temporal granularity of 0.1 seconds.
Metadata: Episode titles, creator descriptions, and extended RSS feed information (e.g., publisher and links).

This composition allows researchers to leverage text, audio, and metadata jointly or independently, depending on the target task. The segmentation strategy for information retrieval tasks involves splitting episodes into overlapping two-minute segments, resulting in approximately 3.4 million segments with an average word count of 340 ± 70 per segment.

2. Shared Tasks and Evaluation Paradigms

The MSP-Podcast dataset serves as the foundation for two shared tasks in the TREC Podcast Track:

A. Segment Retrieval Task:

Participants retrieve two-minute overlapping segments in response to traditional TREC-formatted queries. Topics include type labels (topical, refinding, known item) and detailed descriptions. Evaluation employs the PEGFB graded scale (Perfect–Bad), mean nDCG, nDCG@30, and precision-at-10.

B. Summarization Task:

The objective is to generate concise, grammatical summaries suitable for rapid smartphone consumption. Rather than standardized gold summaries, the “Brass Set”—a filtered pool of creator-provided descriptions—is used as the reference. Summaries are evaluated manually on the EGFB scale (Excellent–Good–Fair–Bad, weighted 4–2–1–0) and automatically via ROUGE-L.

In both tasks, deep learning was the dominant methodology: transformer models (BERT, XLNet, BART, T5, SpanBERT), transfer learning leveraging verbose topic descriptions, and neural architectures integrating GANs and LSTMs were prominent. Most systems focused on the transcript and metadata; usage of audio features was rare but encouraged for future tracks.

3. Deep Learning Methodologies and Key Results

Deep neural approaches were central across all tasks and are detailed as follows:

Segment Retrieval:

Top systems integrated BM25 or query likelihood (traditional IR) with transformer-based reranking (BERT, XLNet). The use of full transcripts, as opposed to titles/descriptions, yielded superior retrieval effectiveness (nDCG up to ~0.67). One team used alternative transcripts generated directly from audio, but this remained the exception.

Summarization:

Abstractive approaches prevailed, utilizing BART (pretrained on CNN, fine-tuned on podcasts), T5 with dialogue action tokens, and SpanBERT for hybrid extractive–abstractive summarization. Audio was not explicitly included and most preprocessing involved sentence segmentation (e.g., SpaCy).

The results revealed that high-quality retrieval and summarization correlated strongly with the inclusion of main topics and named entities. However, substantial variance in manual judgments (“Fair” scores) highlighted limitations of evaluation using only creator descriptions. The dataset itself posed unique challenges due to length, noisy ASR transcripts, and highly variable content.

4. Applications Beyond Information Retrieval

The richness of MSP-Podcast underpins further research domains:

Segmentation of Spoken Content:

Collection and annotation of over 400 ASR-generated transcripts have fostered sequence labeling pipelines for identifying introduction segments via fine-tuned BERT models. Advanced data augmentation (TF–IDF word replacement and random token alteration) boosts generalization to noisy, loosely structured speech, enabling broader structure-based segmentation (Jing et al., 2021).

Topic Modeling on Metadata:

Short-form metadata fields (titles, descriptions) are leveraged for unsupervised topic discovery, addressing data sparsity and annotation noise. Methods include NMF-based approaches (SeaNMF, CluWords) and the NEiCE strategy—an NE-informed document reweighting pipeline integrating named entity cues with word embeddings, yielding up to 15.7% improvement in topic coherence (Valero et al., 2022).

Expressive Speech and Voice Conversion:

The NaturalVoices dataset, derived from the MSP-Podcast audio, contains over 3,800 hours of emotional, spontaneous speech, segmented into 4–6 second utterances. A robust pipeline applies Faster Whisper, CTranslate2, Montreal Forced Aligner, WavLM-based emotion detection, and WADA-SNR analysis for SNR labeling. Evaluations (speaker similarity, MOS, WER, CER) confirm high fidelity and naturalness for VC and expressive speech tasks (Salman et al., 6 Jun 2024).

Speech Emotion Recognition (SER):

SER challenges make use of the MSP-Podcast segment annotations across eight categorical emotions (Anger, Happiness, Sadness, Fear, Surprise, Contempt, Disgust, Neutral). Multimodal, self-supervised models (WavLM, HuBERT, RoBERTa) are fused at the score level with SVM classification. Data balancing and consensus annotation refinements address class imbalances and label noise (Duret et al., 8 Jul 2024).

5. Assessment and Quality Control of Summaries

Unique to podcast data, the summary assessment corpus contains 179 podcast documents, each paired with 20 candidate summaries (extractive and abstractive) and a creator-provided description. Human experts apply the EGFB scale and annotate binary attributes such as topic coverage and named entity inclusion.

Two broad systems for summary assessment are defined (Manakul et al., 2022):

Unsupervised:

Reference-based (ROUGE-L, TripleMatching) and document-based (QA-noun phrase matching, entailment via MNLI) strategies. Reference-based approaches yield high correlation with human judgments; document-based methods display lower or inconsistent alignment due to extractive bias.

Supervised:

CNN on similarity grids (Sentence-BERT + ResNet18) and transformer models (BERT, Longformer) predict absolute scores, achieving system-level Spearman correlations up to 0.909.

Significantly, up to half of the reference summaries are rated “Fair” or “Bad”, motivating summary assessment for data selection prior to summarization model training.

6. Planned Evolutions and Dataset Extensions

Feedback and outcome analyses have led to notable future modifications in TREC Podcast Tracks:

Segment Retrieval:

Allowing participants to select “jump-in” points rather than fixed segments, enabling finer granularity and precision. Addition of topic types suitable for multimodal (audio+text) architectures.

Summarization:

Specifying tasks as “Audio Trailer” or highlight generation, focusing on enticing and informative segments linked to the transcript and audio (aimed at improving efficiency and relevance for end-users).

Assessment Filter Integration:

Supervised assessment scores are used to filter the training sets, aiming to enhance generation quality; these techniques demonstrate systematic biasing toward higher-quality output, although direct ROUGE improvements remain ambiguous.

A plausible implication is that the dataset’s scale, multimodal richness, and continuous methodology advancement position it centrally in ongoing NLP, information retrieval, speech processing, and expressive speech synthesis research.

7. Context Within the Podcast Ecosystem

Although distinct from newer resources like the Structured Podcast Research Corpus (SPoRC) (Litterer et al., 12 Nov 2024), MSP-Podcast has defined standards for large-scale, open podcast research. Relative to SPoRC’s collection (1.1M episodes; multimodal features; detailed network structures and content analyses), MSP-Podcast focuses on fewer episodes but provides word-level timestamped transcripts, extensive metadata, and full audio—balancing scale with depth of annotation.

The dataset’s multi-year usage across TREC and related challenges continues to drive innovation in podcast retrieval, summarization, expressive speech, and emotion modeling, with ongoing community feedback shaping both methodology and future extensions.