Papers
Topics
Authors
Recent
Search
2000 character limit reached

MovieSum Dataset Overview

Updated 17 December 2025
  • MovieSum dataset is a comprehensive resource of 2,200 screenplays paired with human-authored Wikipedia plot summaries, enabling long-document summarization research.
  • It features rigorous manual formatting with XML tags that delineate screenplay components such as dialogues, scene headings, and action blocks.
  • The dataset integrates external metadata via IMDb IDs to support advanced retrieval and knowledge-augmented narrative analysis.

MovieSum is a large-scale dataset designed for abstractive summarization of movie screenplays, addressing the unique challenges posed by extremely long documents and the structural complexity of the screenplay format. Each instance in MovieSum comprises a professionally formatted screenplay and its associated human-authored Wikipedia plot summary, offering both fine-grained structural tags and rich metadata. This resource facilitates research on long-context summarization, narrative understanding, and the exploitation of external knowledge for enhanced movie comprehension (Saxena et al., 2024).

1. Dataset Composition and Scale

MovieSum contains 2,200 English-language movie screenplays, each paired with a Wikipedia plot summary. The dataset was curated from an initial pool of 5,639 raw screenplay files, with extensive filtering conducted to remove duplicates and incomplete scripts. The average screenplay length is 29,000 words (with the bulk ranging from 25,000 to 35,000, and some outliers from 15,000 up to over 40,000 words), far exceeding typical TV episodes or existing movie script corpora. Plot summary lengths average 717 words, with most between 400 and 1,000 words, and distribution tails extending from approximately 200 to 1,800 words.

Abstractiveness is a salient property of MovieSum: 31.7% of summary 1-grams, 68.9% of 2-grams, 93.1% of 3-grams, and 98.6% of 4-grams are novel (i.e., do not appear in the associated screenplay). This high level of n-gram novelty indicates that the summaries require both significant abstraction and compression.

2. Structural Formatting and Annotation

Screenplays in MovieSum were manually corrected and re-formatted using the Celtx professional screenplay tool, yielding fine-grained XML exports in which scene headings, action/description blocks, dialogue blocks, character names, parentheticals, transitions, and shots are all explicitly tagged. This rigorous manual process minimizes the brittleness typical of regular expression-based extraction and ensures that each structural element is captured unambiguously and consistently.

Structural element distinctions are maintained throughout the dataset, supporting in-depth modeling of screenplay-specific features. These tags facilitate both research into the roles of narrative subcomponents and the development of summarization architectures that leverage structure, such as separate encoders or selectors for dialogues versus scene descriptions.

3. Metadata and Linkage to External Knowledge

Each record in MovieSum includes a movie title, IMDb identifier, and release year. Via the IMDb ID, downstream users can programmatically retrieve extended metadata—genre, director, cast, ratings, runtime, budget, box office, user reviews—from the IMDb API. Release years and genres further enable cross-referencing with contemporary reviews, award information, and news articles. This alignment potential supports research in knowledge-augmented summarization, information retrieval, and cross-modal narrative studies.

A plausible implication is that MovieSum provides an experimental bridge for combining natural language understanding with structured external world knowledge, advancing techniques at the intersection of text summarization and multi-source information integration.

4. Dataset Splits and Usage Protocol

MovieSum is randomly divided (seeded for reproducibility) into three standard splits: 1,800 for training, 200 for validation, and 200 for testing. The splits are balanced for consistent distribution of genres and production years. All baseline experiments and evaluations are conducted on these partitions, supporting direct comparison and reproducibility.

5. Baseline Systems and Benchmark Evaluation

MovieSum supports evaluation by both extractive and abstractive summarization models, as well as recent long-context LLMs. The table below summarizes main baseline results (test set) using ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-L), and BERTScore (precision BpB_p, recall BrB_r, F1 BF1B_{F1}):

System R-1 R-2 R-L BF1B_{F1}
Lead-512 10.35 1.27 9.84 46.23
TextRank 33.32 5.27 32.10 51.85
FLAN-UL2 8K (zero-shot) 23.62 4.29 22.01 50.87
Vicuna 13B 16K (zero-shot) 16.35 3.55 15.44 47.07
Moving-window Vicuna 16K 19.56 3.32 18.57 51.53
PEGASUS-X 16K (fine-tuned) 42.42 8.16 40.63 54.36
LongT5 16K (fine-tuned) 41.49 8.54 39.78 55.68
LED 16K (fine-tuned) 44.85 9.83 43.12 58.73

Supervised, fine-tuned long-input models (LED 16K, PEGASUS-X 16K, LongT5 16K) outperform extractive and zero-shot LLM baselines by a considerable margin in ROUGE and BERTScore, but an R-2 ceiling of ≈10 suggests persistent room for improvement, likely due to the extreme input length and required abstraction. Structure-only ablations (dialogues only, descriptions only, heuristic selection within LED) yield highly similar results, confirming that current models do not yet leverage screenplay structure to its full potential.

6. Comparative Analysis and Domain-Specific Challenges

MovieSum differs markedly from prior summarization corpora:

  • Scale and Recency: 2,200 screenplays, over twice the size of ScriptBase-j (917 scripts), with coverage up to 2023.
  • Input Length: Average input length is ≈29,000 words, far exceeding both TV episode corpora (e.g., SummScreenFD: 7,605 words) and narrative plot-based datasets (e.g., NarraSum: ≈786 tokens per document) (Zhao et al., 2022).
  • Structural Annotation: True screenplay format, including scene/audio-visual grammar, versus raw transcripts or informal plot synopses.
  • Metadata: Inclusion of IMDb linkage and release data enables external enrichment, unique among large-scale movie script datasets.

Principal challenges include the need to maintain context across very long documents where salient events are temporally and structurally dispersed, nontrivial blending of dialogue-driven and description-driven narrative content, and high degrees of abstractive re-writing between source and reference. Zero-shot LLMs perform suboptimally, even with extended context windows, indicating incomplete utilization of input due to attention locality or truncation.

7. Research Value and Outlook

MovieSum enables empirical investigation of long-document summarization, abstractive rewriting, and structural understanding within the rich, professionally authored screenplay domain. Key research directions supported include:

  • Development of scalable architectures for very long input processing (beyond 16K tokens)
  • Leveraging structural screenplay tags for more effective segmentation, importance modeling, and cross-scene reasoning
  • Combining script-internal cues with metadata-driven retrieval and external knowledge augmentation (via IMDb IDs)
  • Exploring the limits of current instruction-tuned and fine-tuned LLMs, and the need for hierarchical or memory-based models

The high abstraction requirements and complex document structure render MovieSum a uniquely valuable resource for advancing both methodological and theoretical research at the intersection of long-text summarization, narrative understanding, and information integration in the film domain (Saxena et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MovieSum Dataset.