Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 167 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

TREC Podcast Track for IR & Summarization

Updated 6 October 2025

TREC Podcast Track is a benchmark that drives research in podcast information retrieval and summarization using long, conversational audio data.
Leading models employ neural ranking, transformer-based summarizers, and genre-aware conditioning to overcome challenges like transcript length and noise, achieving notable ROUGE and nDCG improvements.
Robust extraneous content detection using sentence-level classifiers and document-level smoothing significantly reduces non-topical output, enhancing both retrieval precision and summary quality.

The TREC Podcast Track is a shared research benchmark introduced at the Text REtrieval Conference (TREC) in 2020 to advance state-of-the-art methods in information retrieval and summarization for podcasts—a domain distinguished by long-form, conversational audio, highly variable transcription quality, and heterogeneous metadata. The track leverages large-scale data resources to define tasks that challenge deep learning, retrieval, and NLP systems on realistic podcast content, with an emphasis on both segment retrieval and abstractive summarization. Leading submissions exploit neural ranking, autoregressive transformer summarizers, genre-aware conditioning, and approaches for filtering extraneous content, collectively establishing reference methods and evaluation schemes for the emerging area of podcast IR.

1. Scope, Motivation, and Data Resources

The Podcast Track arises from the recognition that podcasts—unlike broadcast news or meetings—present a diverse, often informal linguistic substrate, featuring multiple speakers, conversational style, advertisements, and variable structure (Jones et al., 2021). Its design aims to stimulate research around retrieval and summarization specific to spoken audio, rather than text, pushing the community to address the distinct challenges in processing, segmenting, and describing podcast episodes.

The released dataset encompasses over 100,000 English-language episodes, each with audio, automatic transcript (using Google Speech-to-Text), and rich episode-level metadata, including creator-provided descriptions, publisher details, and RSS feeds. Episodes range in length and complexity, averaging $\sim$ 75 minutes and yielding transcripts upwards of 5,000 tokens—posing constraints given transformer input limitations and hindering model performance.

Substantial preprocessing includes filtering creator descriptions to form a “Brass Set” (66,245 episodes) and further selection for downstream training and evaluation (Zheng et al., 2020). The scale and diversity of the collection present both opportunities and obstacles, with many podcast episodes featuring noisy, colloquial, or ad-dominated content.

2. Track Structure and Tasks

The Podcast Track comprises two primary shared tasks:

A. Segment Retrieval:

Participants retrieve two-minute segments from the podcast collection that best satisfy information needs expressed by traditional TREC topics (topical, known-item, or refinding). Segments are generated through overlapping windows (fixed on the minute), resulting in approximately 3.4 million candidate segments per collection (Jones et al., 2021). Evaluation is performed using mean nDCG, with human assessors assigning graded relevance labels (PEGFB) to ranked segment outputs.

B. Summarization:

The goal is to generate concise, coherent episode summaries—“audio trailers”—using full transcripts and metadata as input. Summaries are compared against filtered creator descriptions, with evaluation based on manual EGFB scoring and automatic ROUGE-L metrics. The absence of true ground-truth summaries, significant linguistic variance, and description noise drive the need for hybrid assessment protocols.

Submissions in both tracks are judged on effectiveness, coherence, coverage of main topics and key participants, and, where relevant, inclusion of names and format information.

3. Methodological Approaches

A broad array of approaches have been adopted by participating teams:

Segment Retrieval:

Traditional IR methods (BM25, query likelihood, Indri) serve as baselines.
Neural ranking dominates, with BERT, XLNet, and other transformer encoders used for segment-level passage re-ranking.
Hybrid pipelines combine BM25 retrieval with neural re-ranking, sometimes using dense vector search (e.g., Faiss).
Overlapping segment windows (50% overlap) address boundary effects.

Summarization:

Abstractive transformers (BART, T5, ProphetNet), typically pretrained on news data (CNN/DailyMail), are fine-tuned on podcast data.
Filtering and selection heuristics mitigate transcript length, using position-based extraction (first $k$ tokens preferred over last) and filtering based on TextRank or hierarchical attention (Manakul et al., 2020).
Advanced summary systems apply genre-aware conditioning (special category tokens prepended to input) and named entity extraction for content weighting (Rezapour et al., 2021).
Sequence-level reinforcement learning objectives combine ROUGE-based reward with cross-entropy loss to align generation targets with evaluation metrics (Manakul et al., 2020).
Ensemble methods combine outputs from multiple model instances (3 or 9), averaging token-level probabilities to enhance stability and performance.

Performance varies: vanilla summarizers trained on news data underperform on podcasts, with ROUGE-1 F1 dropping from $\sim$ 44 (news) to $\sim$ 26 (podcasts) for ProphetNet (Zheng et al., 2020). Genre-aware conditioning yields improvements in both ROUGE and human aggregate scores, with the best models achieving a $\sim$ 9% improvement over unconditioned baselines (Rezapour et al., 2021).

4. Extraneous Content Detection

A significant technical advance in the Podcast Track is robust extraneous content (EC) detection—removing advertisements, promos, and other non-topical material that confound retrieval and summarization systems (Reddy et al., 2021). This is executed via:

Sentence-level classification:

BERT (pretrained and domain-adapted) and logistic regression/SVMs with TF-IDF features, applied to both descriptions and transcripts, classify sentences as “extraneous” or not.

Document-level smoothing:

Change point detection (maximum log-likelihood ratio) segments extraneous blocks in descriptions; kernel regression smooths transcript EC probabilities for contiguous detection.

User behavior integration:

Listener retention data identifies dips (local minima), which, combined with secant slope analysis, suggest candidate ad regions for annotation.

Sequence tagging:

BiLSTM-CRF models using BERT embeddings address EC detection across contiguous spans.

Applying EC detectors as prefilters for summarization models yields substantial quality improvements. For example, BART-Podcasts models originally produced transcripts with up to 73.2% extraneous output, reduced to 2.0% with EC removal, and corresponding ROUGE-L gains (Reddy et al., 2021).

5. Recommendation Systems and User Modeling

Podcast recommendation presents unique challenges in capturing user preference and sequential behavior, distinct from music consumption. Track-adjacent research establishes trajectory-based models, where user history is modeled as a sequential walk across podcast embeddings, strongly outperforming collaborative filtering baselines (Benton et al., 2020).

User trajectories:

RNNs (stacked LSTM with 512 hidden units, followed by dense layers) ingest embedding sequences derived from knowledge graphs or CBOW to predict future podcast selections.

Performance:

Topic-constrained, short-term sequences and age-stratified recommendations increase success at 20 and MRR, with a reported 450% improvement over baselines (e.g., success@20 rising from 0.1186 to 0.4040).

Implications:

Emphasizes the necessity of incorporating knowledge graph information and temporal/local listening patterns for effective podcast recommendation.

6. Evaluation Protocols and Results

Assessment strategies in the Podcast Track integrate both automatic and manual evaluations to account for subjective and objective measures:

Segment Retrieval:

Mean nDCG, nDCG@30, and precision@10 are computed, with top systems reaching nDCG $\sim$ 0.65–0.67. Re-ranking and hybrid approaches outperform pure IR baselines.

Summarization:

Human EGFB scores (Excellent = 4, Good = 2, Fair = 1, Bad = 0) and ROUGE-L (sentence-boundary aware and full-sequence) determine performance. Top ensemble models achieve EGFB $\sim$ 1.777 vs creator description baseline $\sim$ 1.291 (Manakul et al., 2020). ROUGE variants are sensitive to filter quality; summary content alignment with ground truth remains challenging due to description noise and diversity.

Extraneous Content Impact:

Document-level and sentence-level EC filtering demonstrably reduce irrelevant output and increase alignment with main episode topics (Reddy et al., 2021).

7. Technical Challenges and Future Directions

Several persistent challenges and directions are prioritized:

Input length and transcript variability:

Podcasts’ long, multi-speaker transcripts exceed transformer capacity; advanced filtering or chunked processing and long-context architectures (e.g., Longformer, Reformer) are needed (Zheng et al., 2020).

Genre and stylistic diversity:

Summarization benefits from genre conditioning; models still struggle with coherence when extracting non-contiguous segments (Rezapour et al., 2021).

Multimodal integration:

Future research is likely to combine audio features, not just transcripts, for segment retrieval and summarization (Jones et al., 2021).

Evaluation refinement:

Manual assessments currently complement noisy automatic metrics; more robust ground truth generation is needed.

Semi-supervised learning and reward shaping:

Incorporation of automatic grading and carefully tuned reward functions may close the gap between human and model performance (Manakul et al., 2020).

These areas define the trajectory for subsequent Podcast Track editions, with feedback from participants prompting greater specification of tasks, audio-inclusive methodologies, and data subset releases for new entrants.

In sum, the TREC Podcast Track has created foundational resources and challenging tasks for retrieval and summarization on podcast data. By benchmarking advanced neural models, introducing genre and named entity-aware techniques, and emphasizing EC detection, it both catalogs current capabilities and stimulates future research directions across IR, NLP, and spoken document understanding.