PodBench: Podcast Script Generation Benchmark
- PodBench is a benchmark designed to rigorously evaluate LLM performance on instruction-aware, context-grounded podcast script generation using a dual-stage evaluation methodology.
- It separates instruction following from intrinsic script quality by employing a two-stage process with detailed scoring rubrics for content substance and narrative engagement.
- The benchmark utilizes 800 high-quality samples across English and Chinese, highlighting challenges in long-context inputs, multi-speaker coordination, and content depth.
PodBench is a benchmark specifically designed to rigorously evaluate instruction-aware, context-grounded podcast script generation by LLMs. Addressing the paucity of systematic resources for long-form, multi-input, audio-centric text generation, PodBench introduces a reproducible, multi-faceted testbed and scoring machinery that isolates both instruction compliance and script intrinsic quality. By spanning 800 high-quality samples across English and Chinese and integrating evaluation protocols rooted in both quantitative and LLM-based rubrics, PodBench provides detailed insight into LLM performance on the demanding task of audio-oriented dialogue synthesis (Xu et al., 21 Jan 2026).
1. Dataset Composition and Input Organization
PodBench encompasses 800 test samples, proportionally divided between Chinese and English (400 each). The dataset sources originate from five open-licensed corpora—three for Chinese (LongWanjuan, OpenNewsArchive, WanJuan 1.0, all CC BY 4.0) and two for English (WanJuan-CC, CC BY 4.0; Pile-ArXiv, MIT). Input documents are clustered into single- and multi-document aggregates (mean 2.7 documents, ranging 1–10) via Gemini-2.5-pro topic querying in conjunction with Qwen3-Embedding-4B clustering.
Each sample is matched with a synthesized user instruction, generated to span up to eight “requirement dimensions”: (1) focused content, (2) core intent (e.g., summary vs. discussion), (3) script structure, (4) specific expression (e.g., domain terms), (5) podcast language, (6) speaker profile, (7) number of speakers (1–4), and (8) desired podcast length. The initial candidate set of 2.6K samples is filtered both automatically and by five expert annotators to yield the final, high-quality 800-pair corpus.
Key input statistics:
- Instruction length: mean 40.7 tokens (range 4–158)
- Speaker count: mean 1.9 (range 1–4)
- Input prompt length (Qwen3-8B tokenizer): mean 4,922 tokens (min 1,331; max 21,649)
- Length bucket distribution: 0–2K (139), 2–4K (360), 4–8K (153), 8–16K (134), 16–21K (13)
- Supports explicit multi-speaker roles and role descriptions.
2. Evaluation Framework
PodBench evaluation is two-staged, systematically decoupling adherence to user requirements from the intrinsic quality of generated podcast scripts.
Stage 1: Instruction Following (IF)
An LLM judge (Claude-4.5-Opus, high effort, temperature 1.0, top-p 0.95) infers a criterion checklist from user instructions, considering context (language, speakers, topic, length, depth/audience). Each criterion is scored:
- Aggregated as:
Stage 2: Podcast Script Quality (Q)
A 100-point rubric reflects audio-oriented considerations:
- Content Substance (45 pts): Depth of insight (12), argument credibility (10), information richness (8), perspective diversity (8), emotional resonance (7)
- Narrative Engagement (30 pts): Opening hook (6), structure (10), rhythm (8), ending completeness (6)
- Conversational Naturalness (25 pts): Colloquialness/listenability (15), vividness/imagery (10)
Length Auditing: For duration compliance, the heuristic Chinese characters (±5% tolerance) is applied.
Final Score
The composite PodBench score is:
This split explicitly highlights the divergence between superficial compliance and substantive generative competence.
3. Baseline Models and Experimental Protocol
The benchmark includes 34 models grouped as:
- Proprietary LLMs: GPT-5.1-instant, GPT-4o, Claude-4.5-Sonnet, Gemini-3-pro-preview
- Open-Source LLMs (instruct mode): InternLM3-8B, Llama3.1-8B/70B, Llama-4-Scout-17B, Qwen2.5-7B/32B/72B, Qwen3-1.7B/4B/8B/14B/32B/30B-A3B/235B-A22B, DeepSeek-V3-0324, DeepSeek-R1-distill
- Open-Source LLMs (thinking mode): Qwen3 and DeepSeek-R1 variants with explicit reasoning
- Writing-Enhanced LLMs: LongWriter-Llama3.1-8B, LongWriter-GLM4-9B, LongWriter-Zero-32B
All models are evaluated using their recommended inference hyperparameters (max new tokens 16,384; prompt templates enforce source grounding, speaker format, and user constraints).
Key outcomes (Table 2, average of IF and Overall Content Quality):
| Model Type | IF | Content Quality | PB Score |
|---|---|---|---|
| GPT-5.1 (proprietary) | 95.52 | 69.64 | 82.58 |
| Claude-4.5-Sonnet | — | — | 79.91 |
| Gemini-3-pro | — | — | 79.97 |
| GPT-4o | — | — | 73.15 |
| Qwen3-235B (os-instruct) | 93.64 | 58.39 | 76.02 |
| DeepSeek-V3 (os-instruct) | — | — | 76.16 |
| Qwen3-32B (os-instruct) | — | — | 73.54 |
| Qwen3-235B (os-thinking) | — | — | 77.37 |
| Qwen3-32B (os-thinking) | — | — | 76.85 |
| LongWriter-Zero-32B | 84.91 | — | 72.10 |
Other open-source instruct models cluster in the PB 43–66 range, depending primarily on scale and variant.
On a 200-query held-out set, PodBench’s protocol achieves 86.80% agreement with human majority preference, outperforming both uniform rubrics (83.22%) and generic checklist baselines (79.60%).
4. Principal Findings and Diagnostic Insights
- Decoupling Instruction Following and Script Quality: Models, including cutting-edge proprietary and open-source LLMs, frequently attain IF exceeding 90, yet underperform in Content Substance (out of 45 possible points), with script richness and depth under-developed.
- Content Substance Bottlenecks: The lowest sub-scores across all families occur in the “Depth of Insight” and “Argument Credibility” categories, highlighting difficulty in producing sustained, grounded, and analytical dialogue across long-form, multi-speaker settings.
- Long-Context and Robustness Trends: Instruction Following (IF) degrades as prompt length increases beyond 8K tokens, especially for open instruct models, whereas high-performing models can slightly improve script quality when leveraging longer contexts. Explicit reasoning (“thinking mode”) boosts IF stability across segment lengths (0–21K), particularly for medium-sized models; script quality gains are smaller but reliable.
- Multi-Speaker Coordination Costs: Open instruct models exhibit marked declines (−10 pts IF) in 3–4 speaker scenarios, in contrast to proprietary models, which remain robust or marginally improve with dyads. “Thinking mode” mitigates sensitivity to increased speaker count, reflecting the benefit of explicit planning for turn-taking and role tracking within conversational generation settings.
- Implications for Audio-Centric Generation: Explicit planning and reasoning improve structural constraint management (length, role assignment) but do not alone resolve deficits in content depth and substance. Advancement requires training objectives explicitly aligned with sustained analytical synthesis and potentially the incorporation of agentic pipelines or multi-agent coordination mechanisms for enhanced multi-speaker naturalness. Downstream evaluation should be expanded to account for prosody and TTS quality.
5. Significance and Future Directions
PodBench establishes a reproducible, human-aligned testbed for long-form, instruction-aware podcast script generation with explicit dual scoring targeting both user requirement compliance and domain-specific script quality. Its architecture surfaces key challenges in LLM-based audio content generation—notably persistent shortfalls in content substance and multi-speaker coordination under increasingly complex requirements and long-context inputs.
A plausible implication is that addressing these challenges will necessitate architectural and objective innovations in LLM development, including enhanced planning, sustained synthesis capabilities, and more representative evaluation incorporating audio realization metrics. Prospective agentic and multi-agent modeling, as well as targeted training for depth of analysis, are posited as promising avenues for future research within this domain (Xu et al., 21 Jan 2026).