LongBench-v2 & ZeroScrolls Benchmarks
- LongBench-v2 is a benchmark suite that systematically evaluates long-context reasoning across diverse domains such as code, dialogue, and structured data using accuracy, EM, and F1 metrics.
- ZeroScrolls is a zero-shot benchmark focusing on natural language understanding of extensive, naturally-authored documents through tasks like summarization, QA, and aggregation.
- Both benchmarks expose key challenges like score dilution in static self-attention and demonstrate effective strategies like qTTT to enhance LLM inference over extended contexts.
LongBench-v2 and ZeroScrolls are two leading benchmarks designed to evaluate the capabilities of LLMs over extremely long textual contexts, typically in the range of to tokens. They address distinct methodological needs: LongBench-v2 primarily focuses on systematic evaluation of long-context reasoning across multiple domains and task formats, while ZeroScrolls emphasizes zero-shot natural-language understanding over naturally occurring, extensive documents and introduces unique aggregation and sorting tasks. Both benchmarks are central tools in the ongoing assessment and development of LLM architectures and inference-time strategies for handling long-range dependencies.
1. Benchmark Design and Coverage
LongBench-v2
LongBench-v2 comprises six task domains, each presenting either multiple-choice or extractive-answer queries over lengthy documents:
- Code Repositories: Tasks require localization of function call arguments or bug spans across multi-file project trees.
- Long Dialogue History: Models must answer questions whose evidence is embedded in chat transcripts exceeding 20K tokens.
- Long Structured Data: Queries target specific facts within tabular or JSON-like content spanning thousands of tokens.
- Long In-Context (Synthetic): āNeedle-in-haystackā QA over incrementally growing synthetic text.
- Multi-Document QA: Synthesis of information from multiple documents, each typically $2$ā$5$K tokens.
- Single-Document QA: Questions reference targeted information within single 8ā20K token documents.
Context lengths for individual tasks typically range from 5K to 50K tokens, with each domain designed to probe either fine-grained retrieval, multi-hop reasoning, or compositional understanding. Evaluation uses accuracy for multiple-choice/QA tasks; when both Exact Match (EM) and token-level F1 are reported for span questions, EM serves as the primary score. For a predicted answer span and reference span : Final scores are averaged across the official development sets (Bansal et al., 15 Dec 2025).
ZeroScrolls
ZeroScrolls extends the SCROLLS design, removing any requirement for supervised fine-tuning, and includes ten tasks (four summarization, four question answering, two aggregation/sorting), all in the zero-shot regime. Domains include:
- Summarization: U.S. Congressional reports (GovReport), TV scripts (SummScreenFD), meeting transcripts (QMSum), and query-focused story summarization (SQUALITY).
- Question Answering: Full research papers (Qasper), multi-hop Wikipedia QA (MuSiQue), narrative comprehension (NarrativeQA), and long-form multiple-choice (QuALITY).
- Aggregation and Sorting: Sentiment aggregation over 50 hotel reviews (SpaceDigest), and chapter ordering in narrative texts (BookSumSort).
Inputs typically range from 3K to 50K tokens per example. Each task uses a metric aligned with its output type: ROUGE geometric mean, F1, accuracy, exponential similarity (ES) for aggregation, and concordance index (Cidx) for sorting (Shaham et al., 2023).
2. Evaluation Protocols and Metrics
LongBench-v2 evaluates models independently per example, averaging scores over domains. For extractive QA, EM and F1 are both measured but EM is primary. For multiple-choice, accuracy is calculated as .
ZeroScrolls enforces a canonical zero-shot format: each example comprises natural-language instructions, the long context, and an output to be generated without any in-context exemplars. The gold references for test and small validation splits are private, supporting a live leaderboard. Metrics for evaluation include:
- for summarization,
- F1 (maximum over multi-reference set) for extractive and classification QA,
- Exponential Similarity for sentiment aggregation,
- Concordance index for ordering tasks.
A key design distinction is that ZeroScrolls excludes any training set, enforcing rigorous zero-shot evaluation and compelling LLMs to operate via prompt-based transfer alone.
3. Empirical Performance and Model Comparison
Empirical results from āLetās (not) just put things in Context ā¦ā demonstrate acute score dilution in static self-attention as context grows, with baseline in-context accuracy on LongBench-v2 collapsing for longer input lengths (Bansal et al., 15 Dec 2025). Three inference-time strategies are compared for Qwen3-4B:
- In-Context Only: No additional inference computation beyond initial forward pass.
- Thinking (Chain-of-Thought): Generation of intermediate tokens for reasoning.
- Query-Only Test-Time Training (qTTT): 32 steps of adaptation (span size ) of query-projection matrices using cross-entropy loss over next-token prediction, leaving other parameters fixed.
For compute-matched conditions (), the following average scores are observed:
| Regime | LongBench-v2 (%) | ZeroScrolls (%) |
|---|---|---|
| In-Context | 27.0 | 18.4 |
| Thinking | 33.5 | 23.9 |
| qTTT | 39.6 | 32.5 |
qTTT delivers +12.6 and +14.1 percentage-point gains, respectively, relative to baseline in-context inference. On domains requiring pinpoint retrieval (Code Repos, Multi-Document QA, MuSiQue, QA uLITY), qTTTās margin exceeds 20 pp under matched computational budgets.
ZeroScrolls scores from the original benchmark paper (Shaham et al., 2023) show GPT-4 (41.7) and Claude (39.1) leading among zero-shot LLMs, with naive baselines trailing far behind on aggregation and long-range ordering.
4. Theoretical Underpinnings and Score Dilution
Score dilution arises as an intrinsic limitation of static self-attention at large context lengths. Given distractors with attention logits within margin of the target, the attention mass on the target satisfies: Thus, for constant-gap and , the attention on the correct token approaches zero as (āScore Dilution Lemmaā). To guarantee a fixed minimum mass, the logit margin must scale as . āThinkingā tokens (intermediate autoregressive outputs) cannot overcome dilution because their own mass on the āneedleā position becomes vanishingly small in long contexts.
The qTTT method addresses this by directly updating query projection matrices via the following loss: with gradients taken only with respect to . Gradient descent increases the target-distractor logit margin, converting separation into the necessary separation, thus provably restoring attention on the correct positions (Bansal et al., 15 Dec 2025).
5. Benchmark Comparison, Limitations, and Open Challenges
LongBench-v2 and ZeroScrolls complement each other in evaluation focus. LongBench-v2 emphasizes systematic, often synthetic, extraction and retrieval challenges over long input, with a context range extending to 50K tokens and beyond. ZeroScrolls uniquely targets zero-shot transfer to naturally-authored, multi-domain documents, introducing information-aggregation and ordering tasks not present in other benchmarks.
A summary of ZeroScrolls tasks and performance (normalized scores, (Shaham et al., 2023)):
| Model | QA/MC (QuALITY) | Summarization (GovReport) | Aggregation (SpaceDigest) | Avg Score |
|---|---|---|---|---|
| GPT-4 | 89.2 | 26.3 | 62.8 | 41.7 |
| Claude | 84.8 | 24.2 | 61.6 | 39.1 |
| Naive baseline | 26.6 | 22.6 | 45.0 | 19.6 |
Persistent open challenges highlighted by ZeroScrolls include:
- Aggregation (classification, counting, arithmetic over long inputs): most models fail to surpass naive baselines on these tasks.
- Zero-shot summarization: substantial performance gap with respect to fine-tuned state-of-the-art.
- Strict automatic metrics versus output formatting drift: n-gram metrics often penalize semantically valid but format-inconsistent outputs; e.g., GPT-4ās low F1 despite superior human annotations.
- Robust semantic evaluation metrics and dynamic prompt engineering remain active research areas.
6. Significance and Research Directions
Recent advances, as evidenced by Bansal et al. (Bansal et al., 15 Dec 2025), reveal that limited test-time parameter adaptation is substantially more effective than expanded autoregressive output (āthinking tokensā) under fixed computational budgets for long-context LLMs. The reported gainsā and pp for LongBench-v2 and ZeroScrolls, respectivelyārepresent the largest known improvements on these benchmarks without changing model architecture or pretraining data. Notably, these improvements are robust across model sizes, and ablations confirm that the primary effect is mitigation of score dilution, not rectification of positional encoding failures.
ZeroScrolls further establishes a high bar for zero-shot long-document reasoning, with tasks measuring not only classic QA and summarization but also aggregation, sorting, and multi-hop reasoning under conditions of minimal task-specific adaptation. This combination of benchmarks now constitutes the de facto standard for evaluating long-context LLMs and for guiding future architectural and inference-time innovations in the field.
7. Related Benchmarks and Distinctions
Comparison to other benchmarks such as SCROLLS, HELM, and BigBench highlights the unique contributions of LongBench-v2 and ZeroScrolls:
- SCROLLS allows fine-tuning per task; ZeroScrolls compels true zero-shot evaluation.
- HELM/BigBench cover diverse tasks but are skewed toward short-text input (1000 tokens), rendering them insufficient for long-range reasoning.
- LongBench-v2 supports long-context evaluation in both synthetic and natural domains and operates primarily in a supervised evaluation regime.
- ZeroScrolls remains the only benchmark enforcing private, leaderboard-based evaluation for zero-shot, naturally-authored, extra-long documents (Shaham et al., 2023).
Together, these benchmarks serve as key drivers of progress in long-context language modeling, evaluation methodology, and inference-time adaptation strategies.