SCROLLS Benchmark Overview
- SCROLLS Benchmark is a comprehensive suite designed to evaluate long-context reasoning in natural language processing by aggregating tasks from diverse domains.
- It recasts various tasks, including summarization, question answering, and natural language inference, into a unified text-to-text format for consistent evaluation.
- The benchmark employs robust evaluation metrics and baseline comparisons to highlight performance gaps between state-of-the-art models and human reasoning over extended texts.
SCROLLS (Standardized CompaRison Over Long Language Sequences) is a benchmark suite for NLP designed to rigorously evaluate models' abilities to reason over naturally long texts. It addresses the gap in existing benchmarks, which predominantly focus on short context lengths, by aggregating tasks that require synthesizing information spread across extensive narrative, technical, and dialog contexts. SCROLLS recasts its diverse tasks into a unified text-to-text format and enables standardized evaluation and comparison of model architectures, particularly those capable of handling long input sequences.
1. Motivation and Benchmark Scope
SCROLLS was developed in response to limitations in prevalent NLP benchmarks, which emphasize short sequences, despite the prevalence of lengthy documents in practical application domains (e.g., books, scientific articles, meeting minutes, and legal contracts). Advances in long-sequence transformers such as Longformer and its encoder-decoder variant (LED) motivated the need for a comprehensive framework that evaluates long-range reasoning on authentic, real-world data without artificial padding or domain constraints. SCROLLS fills this role by providing a multi-domain, multi-task platform focused on the synthesis of evidence and information distributed across large input spans.
2. Task Types and Domains
The SCROLLS suite encompasses a range of reasoning challenges distributed over several domains. The tasks fall into three core types:
- Summarization: Corpus-level and query-based summarization.
- Question Answering (QA): Both open-ended and multiple-choice formats.
- Natural Language Inference (NLI): Entailment classification over long documents.
The selected domains include government reports, television and film scripts, meeting transcripts, scientific papers, literature, and legal documents. The following table summarizes the included datasets, representative task types, and domains:
| Dataset | Task | Domain |
|---|---|---|
| GovReport | Summarization | Government |
| SummScreenFD | Summarization | TV Shows |
| QMSum | Query Summar. | Meetings |
| Qasper | QA | Science Papers |
| NarrativeQA | QA | Literature/Film |
| QUALITY | Multiple-Choice QA | Literature/Misc |
| ContractNLI | NLI | Legal |
This multi-domain selection ensures that models are assessed on authentic long-context reasoning rather than mere capacity to process large sequences.
3. Dataset Selection and Preprocessing
SCROLLS curators systematically handpicked datasets according to the following principal criteria:
- Natural Context Length: Input sequences are innately long, reflecting real-world scenarios.
- Long-range Reasoning Requirement: Tasks require evidence synthesis across substantial spans of text.
- Genre and Reasoning Diversity: Inclusion of diverse genres and multiple reasoning types.
- Quality Control: Rigorous cleaning eliminates trivial cases (e.g., where summaries are verbatim in the input) and enforces minimum length and non-overlap constraints for input-output pairs.
All tasks are converted into a uniform sequence-to-sequence format. For QA and NLI tasks, the query or hypothesis is prepended to the long context, separated with double newlines; answer options for multiple-choice QA are included in the input string. This approach enables uniform modeling and evaluation protocols across all datasets.
Example Format:
1 2 3 4 5 |
Input: [Query/Prompt] [Long Context/Document] Output: [Answer/Summary/Label] |
4. Evaluation Metrics and Aggregate Scoring
Task-specific and standard automatic evaluation metrics are utilized:
- ROUGE (geometric mean of ROUGE-1/2/L): Applied to summarization tasks.
- F1 Score (unigram overlap): Applied to open-ended QA.
- Exact Match (EM): Applied to multiple-choice QA and NLI.
The aggregate SCROLLS score is computed as the average of scores across all datasets, providing a single comprehensive metric for model performance across the entire suite.
5. Baseline Models, Results, and Human Comparison
SCROLLS establishes three baseline categories:
- Heuristic Baselines: Naive strategies, including input prefix or majority class heuristics, to establish a lower-bound of expected performance.
- BART-base: A standard encoder-decoder transformer model pre-trained on sequences up to 1024 tokens. Evaluated with truncation at fixed context lengths (256, 512, 1024 tokens) due to architectural constraints.
- Longformer Encoder-Decoder (LED-base): Extends Longformer’s efficient attention mechanism to encoder-decoder settings, supporting inputs up to 16,384 tokens. Initialized from BART weights without further pre-training on long sequences.
Key Observations:
- Context Length Correlation: Models with access to longer contexts produce better results, highlighting the criticality of long-context access for these tasks.
- Relative Model Performance: LED achieves the highest scores on the largest datasets; however, its improvements over context-truncated BART are modest, even when utilizing the full context window.
- Human-Model Disparity: Substantial performance gaps exist; for example, on Qasper, human F1 is approximately 61%, whereas the highest model F1 is 26.6%; on QUALITY, human EM is ~93.5% compared to the best model EM of ~25.8%.
A plausible implication is that effective aggregation and synthesis over extensive narrative spans remain unsolved problems for current model architectures.
6. Platform Infrastructure and Ongoing Benchmarking
SCROLLS hosts a live leaderboard utilizing private test sets. Participants must submit model outputs for all datasets, after which scores are calculated automatically. The public leaderboard provides not only per-dataset evaluations but also an aggregate performance score, incentivizing holistic model development and facilitating transparent, continuous community benchmarking and comparison.
7. Significance, Limitations, and Future Directions
SCROLLS advances NLP evaluation by supplying the first comprehensive, multi-domain benchmark that focuses on tasks naturally requiring long-context reasoning. The unified text-to-text paradigm allows for general-purpose modeling and research across pretraining, fine-tuning, and inference regimes, bypassing dataset-specific engineering. The benchmark exposes clear limitations in prevailing architectures—the gap in both aggregate and per-task performance relative to human benchmarks points to the need for further innovation in both model architectures and training regimes targeting long-sequence comprehension.
Current limitations include dependence on English-language datasets and reliance on standard automatic metrics (e.g., ROUGE), which may undervalue legitimate paraphrases in summarization. Extending SCROLLS to additional languages and improving evaluative methodologies are recognized as future directions.
In summary, SCROLLS standardizes, challenges, and motivates research in the domain of long-context modeling, providing unified infrastructure and rigorous tasks that expose the essential difficulties in reasoning over long language sequences.