Papers
Topics
Authors
Recent
Search
2000 character limit reached

PodBench: Podcast Script Generation Benchmark

Updated 28 January 2026
  • PodBench is a benchmark designed to rigorously evaluate LLM performance on instruction-aware, context-grounded podcast script generation using a dual-stage evaluation methodology.
  • It separates instruction following from intrinsic script quality by employing a two-stage process with detailed scoring rubrics for content substance and narrative engagement.
  • The benchmark utilizes 800 high-quality samples across English and Chinese, highlighting challenges in long-context inputs, multi-speaker coordination, and content depth.

PodBench is a benchmark specifically designed to rigorously evaluate instruction-aware, context-grounded podcast script generation by LLMs. Addressing the paucity of systematic resources for long-form, multi-input, audio-centric text generation, PodBench introduces a reproducible, multi-faceted testbed and scoring machinery that isolates both instruction compliance and script intrinsic quality. By spanning 800 high-quality samples across English and Chinese and integrating evaluation protocols rooted in both quantitative and LLM-based rubrics, PodBench provides detailed insight into LLM performance on the demanding task of audio-oriented dialogue synthesis (Xu et al., 21 Jan 2026).

1. Dataset Composition and Input Organization

PodBench encompasses 800 test samples, proportionally divided between Chinese and English (400 each). The dataset sources originate from five open-licensed corpora—three for Chinese (LongWanjuan, OpenNewsArchive, WanJuan 1.0, all CC BY 4.0) and two for English (WanJuan-CC, CC BY 4.0; Pile-ArXiv, MIT). Input documents are clustered into single- and multi-document aggregates (mean 2.7 documents, ranging 1–10) via Gemini-2.5-pro topic querying in conjunction with Qwen3-Embedding-4B clustering.

Each sample is matched with a synthesized user instruction, generated to span up to eight “requirement dimensions”: (1) focused content, (2) core intent (e.g., summary vs. discussion), (3) script structure, (4) specific expression (e.g., domain terms), (5) podcast language, (6) speaker profile, (7) number of speakers (1–4), and (8) desired podcast length. The initial candidate set of 2.6K samples is filtered both automatically and by five expert annotators to yield the final, high-quality 800-pair corpus.

Key input statistics:

  • Instruction length: mean 40.7 tokens (range 4–158)
  • Speaker count: mean 1.9 (range 1–4)
  • Input prompt length (Qwen3-8B tokenizer): mean 4,922 tokens (min 1,331; max 21,649)
  • Length bucket distribution: 0–2K (139), 2–4K (360), 4–8K (153), 8–16K (134), 16–21K (13)
  • Supports explicit multi-speaker roles and role descriptions.

2. Evaluation Framework

PodBench evaluation is two-staged, systematically decoupling adherence to user requirements from the intrinsic quality of generated podcast scripts.

Stage 1: Instruction Following (IF)

An LLM judge (Claude-4.5-Opus, high effort, temperature 1.0, top-p 0.95) infers a criterion checklist from user instructions, considering context (language, speakers, topic, length, depth/audience). Each criterion ii is scored:

  • si{0(fail),0.5(partial),1(pass)}s_i \in \{0\,\text{(fail)},\, 0.5\,\text{(partial)},\, 1\,\text{(pass)}\}
  • Aggregated as: IF=100×(1/N)i=1Nsi\mathrm{IF} = 100 \times (1/N)\sum_{i=1}^N s_i

Stage 2: Podcast Script Quality (Q)

A 100-point rubric reflects audio-oriented considerations:

  • Content Substance (45 pts): Depth of insight (12), argument credibility (10), information richness (8), perspective diversity (8), emotional resonance (7)
  • Narrative Engagement (30 pts): Opening hook (6), structure (10), rhythm (8), ending completeness (6)
  • Conversational Naturalness (25 pts): Colloquialness/listenability (15), vividness/imagery (10)

Length Auditing: For duration compliance, the heuristic 1min3001\,\text{min} \approx 300 Chinese characters (±5% tolerance) is applied.

Final Score

The composite PodBench score is: PB=IF+Q2\mathrm{PB} = \frac{\mathrm{IF} + Q}{2}

This split explicitly highlights the divergence between superficial compliance and substantive generative competence.

3. Baseline Models and Experimental Protocol

The benchmark includes 34 models grouped as:

  • Proprietary LLMs: GPT-5.1-instant, GPT-4o, Claude-4.5-Sonnet, Gemini-3-pro-preview
  • Open-Source LLMs (instruct mode): InternLM3-8B, Llama3.1-8B/70B, Llama-4-Scout-17B, Qwen2.5-7B/32B/72B, Qwen3-1.7B/4B/8B/14B/32B/30B-A3B/235B-A22B, DeepSeek-V3-0324, DeepSeek-R1-distill
  • Open-Source LLMs (thinking mode): Qwen3 and DeepSeek-R1 variants with explicit reasoning
  • Writing-Enhanced LLMs: LongWriter-Llama3.1-8B, LongWriter-GLM4-9B, LongWriter-Zero-32B

All models are evaluated using their recommended inference hyperparameters (max new tokens 16,384; prompt templates enforce source grounding, speaker format, and user constraints).

Key outcomes (Table 2, average of IF and Overall Content Quality):

Model Type IF Content Quality PB Score
GPT-5.1 (proprietary) 95.52 69.64 82.58
Claude-4.5-Sonnet 79.91
Gemini-3-pro 79.97
GPT-4o 73.15
Qwen3-235B (os-instruct) 93.64 58.39 76.02
DeepSeek-V3 (os-instruct) 76.16
Qwen3-32B (os-instruct) 73.54
Qwen3-235B (os-thinking) 77.37
Qwen3-32B (os-thinking) 76.85
LongWriter-Zero-32B 84.91 72.10

Other open-source instruct models cluster in the PB 43–66 range, depending primarily on scale and variant.

On a 200-query held-out set, PodBench’s protocol achieves 86.80% agreement with human majority preference, outperforming both uniform rubrics (83.22%) and generic checklist baselines (79.60%).

4. Principal Findings and Diagnostic Insights

  1. Decoupling Instruction Following and Script Quality: Models, including cutting-edge proprietary and open-source LLMs, frequently attain IF exceeding 90, yet underperform in Content Substance (out of 45 possible points), with script richness and depth under-developed.
  2. Content Substance Bottlenecks: The lowest sub-scores across all families occur in the “Depth of Insight” and “Argument Credibility” categories, highlighting difficulty in producing sustained, grounded, and analytical dialogue across long-form, multi-speaker settings.
  3. Long-Context and Robustness Trends: Instruction Following (IF) degrades as prompt length increases beyond 8K tokens, especially for open instruct models, whereas high-performing models can slightly improve script quality when leveraging longer contexts. Explicit reasoning (“thinking mode”) boosts IF stability across segment lengths (0–21K), particularly for medium-sized models; script quality gains are smaller but reliable.
  4. Multi-Speaker Coordination Costs: Open instruct models exhibit marked declines (−10 pts IF) in 3–4 speaker scenarios, in contrast to proprietary models, which remain robust or marginally improve with dyads. “Thinking mode” mitigates sensitivity to increased speaker count, reflecting the benefit of explicit planning for turn-taking and role tracking within conversational generation settings.
  5. Implications for Audio-Centric Generation: Explicit planning and reasoning improve structural constraint management (length, role assignment) but do not alone resolve deficits in content depth and substance. Advancement requires training objectives explicitly aligned with sustained analytical synthesis and potentially the incorporation of agentic pipelines or multi-agent coordination mechanisms for enhanced multi-speaker naturalness. Downstream evaluation should be expanded to account for prosody and TTS quality.

5. Significance and Future Directions

PodBench establishes a reproducible, human-aligned testbed for long-form, instruction-aware podcast script generation with explicit dual scoring targeting both user requirement compliance and domain-specific script quality. Its architecture surfaces key challenges in LLM-based audio content generation—notably persistent shortfalls in content substance and multi-speaker coordination under increasingly complex requirements and long-context inputs.

A plausible implication is that addressing these challenges will necessitate architectural and objective innovations in LLM development, including enhanced planning, sustained synthesis capabilities, and more representative evaluation incorporating audio realization metrics. Prospective agentic and multi-agent modeling, as well as targeted training for depth of analysis, are posited as promising avenues for future research within this domain (Xu et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PodBench.