Structured Podcast Research Corpus (SPoRC)
- Structured Podcast Research Corpus (SPoRC) is a comprehensive dataset featuring full transcripts, audio metrics, and speaker roles for over 1.1M English podcast episodes.
- It employs advanced methodologies including Whisper ASR for transcription, openSMILE for audio feature extraction, and pyannote with RoBERTa for diarization and role inference.
- SPoRC facilitates diverse research applications such as discourse segmentation, emotion detection, and community structure analysis within the podcast ecosystem.
The Structured Podcast Research Corpus (SPoRC) is a large-scale dataset designed to facilitate computational research on spoken-word media, particularly podcasts, by providing transcriptions, aligned audio features, speaker annotations, and content-derived metadata across more than 1.1 million English-language podcast episodes. Spanning both highly curated subsets for segment-structure modeling and broad, ecosystem-wide coverage, SPoRC integrates text, audio, diarization, and speaker-role inferences, enabling foundational studies of language, discourse structure, emotional framing, and social dynamics within the podcast medium (Jing et al., 2021, Litterer et al., 2024, Moldovan et al., 16 Sep 2025).
1. Corpus Architecture and Data Schema
SPoRC episodes are represented as tuples , where:
- denotes episode-level metadata (show ID, episode ID, publication date, duration, category, hosting platform, etc.).
- encapsulates the transcript, with each token aligned to a timestamp .
- provides prosodic and acoustic features (F₀, F₁, MFCC₁–₄), computed per token via openSMILE from 10 Hz audio frames.
- lists speaker turns (for 370,000 diarized episodes), as spans .
- contains inferred speaker-role assignments (Host, Guest, Neither) based on episode text and metadata.
All 1.1 million+ episodes are accompanied by full transcripts generated via OpenAI’s “whisper-base” ASR model. Speaker diarization and acoustic features exist for a significant subset (approximately 370,000 episodes), while role inferences—using an annotated and model-fitted approach—cover the entire corpus (Litterer et al., 2024).
2. Corpus Construction and Preprocessing
Large-Scale Section
SPoRC assembled over 1.1 million episodes published between May 1 and June 30, 2020, harvesting audio and metadata via the Podcast Index, which catalogs more than 4 million shows. Audio was processed into transcripts using the Whisper ASR engine (74 M parameters), and transcripts were filtered to remove those with spurious repetition (e.g., if a single 4-gram makes up more than 5% of the transcript’s 4-grams, indicative of ASR hallucination).
Audio features were extracted using openSMILE’s eGeMAPS configuration, while speaker diarization was performed using pyannote. Diarization output was subject to a minimum speaking-time threshold (≥5% of episode duration) to eliminate ephemeral speakers and potential artifacts.
Speaker-role assignment leverages named entity recognition (spaCy on metadata and transcript heads), crowd-sourced annotation (Krippendorff’s α = 0.77), and a fine-tuned RoBERTa classifier achieving 0.88 test accuracy in label prediction (Litterer et al., 2024).
Human-Labeled Subset
A focused subcorpus of 417 complete episodes was curated from Pandora/SiriusXM, representing 20 topical categories (technology, sports, true crime, etc.). Transcripts, automatically generated by Google Cloud Speech-to-Text, were left unedited except for manual annotation of episode-introduction boundaries. Annotators marked the start and end tokens of the segment where hosts or guests present the episode’s topic or participants, excluding recurring program-level intros. Inter-annotator agreement was measured on 117 episodes, yielding majority agreement on start offsets in >96% of cases within ±2 seconds (Jing et al., 2021).
3. Modalities, Annotation, and Derived Features
The main data modalities include:
- Transcripts: Word-level text with timestamps for all episodes.
- Audio Features: Per-token measures of F₀, F₁, and MFCC₁–₄.
- Speaker Turns/Diarization: Detailed for a sizable subset; label assignment based on audio-based segmentation.
- Speaker Roles: Host/Guest/Neither inferences at bundle or episode level.
- Manual Episode Structure: In the labeled subset, precise intro-segment offsets anchored to transcript tokens.
Table 1: Modalities and Coverage
| Modality | Size/Coverage | Extraction Method |
|---|---|---|
| Transcripts | 1.1M episodes (6.6B w) | Whisper-base |
| Audio features | 1.1M (all) | openSMILE (eGeMAPS) |
| Speaker diarization | 370K episodes | pyannote |
| Speaker roles | 1.1M (all) | Annotated, RoBERTa |
| Manual intro boundaries | 417 episodes | Human annotation |
The full corpus permits increasingly nuanced segmentation and labeling: sentence-level splits, non-verbal markers, and sentence-level emotion detection (GoEmotions) are incorporated for downstream modeling (Moldovan et al., 16 Sep 2025).
4. Segment Detection, Modeling, and Evaluation
The identification of podcast structure—especially episode introductions—is cast as a token-wise, two-class sequence-labeling problem: “Is-intro” vs. “Not-intro.” Transformer-based models (BERT base, bert-base-uncased with 12 layers) outperformed static embedding methods, particularly when transcript augmentations were applied. Two data augmentation schemes are used:
- TF–IDF–based word replacement (tfidfwr): Replaces less informative tokens with others of similar TF–IDF.
- Random edit augmentation (randomaug): Either swaps, deletes, or crops random tokens. These procedures, implemented via nlpaug, yield five synthetic variants per episode (Jing et al., 2021).
During training, overlapping input spans (512 tokens, with 128-token overlaps) are processed, and token-wise sigmoid probabilities indicate the likelihood of “Is-intro.” Final boundary detection applies a “maximum-difference” algorithm over smoothed probability series. The cross-entropy loss over binary token labels is minimized via AdamW (initial learning rate ), for up to 300 epochs.
Evaluations employ:
- Boundary-offset accuracy (correct within offset tokens).
- Jaccard overlap ().
- (Not reported but common:) Precision, Recall, F1.
Results indicate that data augmentation (particularly randomaug) improves robustness to unseen podcast formats, with overlap scores for BERT models approaching 0.4–0.5 on test splits of seen programs. It is notably more difficult for both models and human annotators to localize introduction ends than starts; conversational and music-interleaved episodes challenge structure recognition (Jing et al., 2021).
5. Downstream Analyses and Applications
SPoRC has enabled foundational investigations including:
- Large-scale topic modeling (LDA with topics) for content diversity measurement and visualization of topical concentration and spillover (e.g., COVID-19, social movements spanning multiple podcast categories).
- Community structure analysis via guest co-appearance bipartite graphs, revealing highly insular guest-sharing in categories like Sports () and Business (), and diffuseness in Religion/Society (Litterer et al., 2024).
- Temporal responsiveness: Quantification of collective attention during major events, e.g., 21% of episodes mentioning “George Floyd” during the peak of racial justice protests.
- Emotion and collective action modeling: For the Black Lives Matter case study, SPoRC supports layered classification of participatory speech acts and associated emotion detection at the sentence level (eight emotion categories; e.g., joy, optimism, anger, sadness). The analytical framework combines RoBERTa-based binary detection (participation vs. non-participation) with fine-tuned Llama 3 multi-category assignment (problem–solution, call-to-action, intention, execution), then computes emotion odds ratios to characterize stage-emotion association (e.g., ( OR_{\rm Intention,\,optimism} \approx 3.84\ )). Negative emotions are negatively associated with collective action, contrary to pre-existing sociological theory (Moldovan et al., 16 Sep 2025).
6. Access, Licensing, Limitations, and Usage
Large-scale SPoRC resources are made available upon request under research-only licensing terms; no commercial redistribution is permitted. The Pandora/SiriusXM subset likewise requires researcher application (Jing et al., 2021, Litterer et al., 2024).
Key limitations include:
- Coverage gaps: Exclusion of private/non-RSS podcasts (e.g., The Joe Rogan Experience) and incomplete podcast-index coverage.
- ASR error propagation: Despite n-gram-based transcript filtering, Whisper may hallucinate or misconstrue low-resourced dialects or noisy audio.
- Diarization/role inference limitations: Threshold-based speaker pruning and reliance on self-introduction for host/guest determination.
- Temporal restriction: The two-month selection period represents a “thick slice” suitable for high-density temporal analyses but may not generalize across podcasting history.
- Non-mechanistic inference: Patterns of topic or attention inferred from descriptive statistics; underlying causal structures are not modeled.
Intended downstream use cases include spoken-word segmentation (trailers, highlight extraction), summarization, recommendation (using intros as metadata), discourse analysis on noisy ASR data, and social-scientific investigations of mobilization and emotional framing in spoken digital media (Jing et al., 2021, Litterer et al., 2024, Moldovan et al., 16 Sep 2025).
7. Significance and Prospective Impact
SPoRC constitutes the first large-scale, multi-modal corpus integrating full-transcript, speaker, audio, and inferred-structure data for podcasts. Its multimodal breadth and support for nuanced annotation—spanning event-structure, sentiment, and social participation framing—enable computational studies of linguistic, rhetorical, and affective patterns in spoken media at an unprecedented scale.
A plausible implication is that SPoRC can advance methods for segment detection in noisy ASR outputs, facilitate cross-modal retrieval, and provide empirical footing for studies of digital discourse, collective attention, and affective mobilization outside text-centric social media. The dataset has already supported analyses demonstrating distinctive emotional dynamics in spoken activism discourse as compared to written content, and enabled the quantification of community structure and responsiveness in the podcast ecosystem (Litterer et al., 2024, Moldovan et al., 16 Sep 2025).