RadioTalk: a large-scale corpus of talk radio transcripts

Published 16 Jul 2019 in cs.CL | (1907.07073v1)

Abstract: We introduce RadioTalk, a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information. In this paper we summarize why and how we prepared the corpus, give some descriptive statistics on stations, shows and speakers, and carry out a few high-level analyses.

Abstract PDF Upgrade to Chat

Citations (10)

View on Semantic Scholar

Summary

The paper introduces RadioTalk, a large-scale corpus of 2.8 billion words from over 284,000 hours of U.S. talk radio transcripts.
It employs a robust speech-to-text pipeline with a TDNN model achieving a 13.1% word error rate and enriched metadata for detailed analysis.
Analyses reveal significant regional, gender, and syndication trends that offer actionable insights into media influence, conversational NLP, and sociopolitical dynamics.

RadioTalk: A Large-Scale Corpus of Talk Radio Transcripts

The paper presents RadioTalk, a substantial corpus sourced from the transcription of talk radio broadcasts in the United States, spanning the period from October 2018 to March 2019. This corpus is distinguished by its scale, incorporating roughly 2.8 billion words derived from 284,000 hours of content. The primary aim of this work is to furnish researchers in NLP, conversational analysis, and social sciences with a rich dataset that encapsulates the dynamics and nuances of talk radio as a medium.

Methodology and Corpus Preparation

The preparation of the RadioTalk corpus involved a comprehensive pipeline consisting of ingestion, transcription, and post-processing stages. The ingestion phase captures audio streams from publicly available online sources of diverse radio stations across the United States. Transcription is conducted using an advanced speech-to-text system hinged on a time-delay neural network (TDNN) architecture, adept at handling reverberant speech environments. The transcription accuracy is notably supported by fine-tuned lexicons and LLMs trained on several human-transcribed radio corpora, resulting in a competitive word error rate of approximately 13.1%.

The post-processing stage enriches the corpus with metadata including speaker identifiers, speaker gender, and classification of speech as studio or telephone call-in, alongside program identifiers. This nuanced metadata enhances the utility of the corpus for tasks requiring detailed conversational context and demographic analysis.

Key Statistical Insights

The dataset encapsulates 31.1 million speaker turns, systematically segmented with statistical insights offered across dimensions like syndication status, studio vs telephone origin of speech, and gender.
Non-syndicated content accounts for approximately 32.7% of the speaker turns, reinforcing the dataset's alignment with local conversational content.
A gender imbalance is noted with female speakers constituting only 27.8% of the turns, whereas male speakers contribute the majority. This discrepancy is slightly mitigated in telephone call-ins.
Notably, the confidence score metrics suggest increased recognition difficulty with telephone speech and female speakers.

Analysis of Radio Content

Comprehensive analyses shed light on the topical content and geographical spread of discussed themes. The corpus captures the diversity of radio discourse, with significant variations in themes such as immigration policy, opioid crisis discussions, and climate change based on temporal and regional distinctions. This depth offers an avenue for understanding regional socio-political dynamics and the media's role in shaping public discourse.

Exploring syndication networks reveals the cohesive yet ideologically distinct communities within the talk radio landscape, an insight further confirmed through network community detection algorithms. The paper identifies two main station communities, indicative of prevailing ideological divides, therefore underscoring the corpus's potential in examining media influence and polarization.

Implications and Future Directions

The RadioTalk corpus holds substantial promise for advancing research in conversational NLP and social sciences. It presents opportunities for sophisticated analyses of the language and interaction patterns on talk radio, providing insights into media effects, regional dialects, and conversational turn-taking. While transcription errors exist, their impact is mitigated by the robust scale and detailed metadata of the corpus.

Future iterations of RadioTalk aim to expand this foundational work, potentially encompassing more audio data, refined transcriptions, and enhanced descriptive metadata fields. As a resource, RadioTalk is poised to support diversified research efforts, potentially influencing both theoretical frameworks and practical applications in understanding and leveraging conversational media.

In summation, the corpus serves as a critical asset for researchers exploring the intricacies of radio broadcast content and its broader societal implications.

Markdown