LongBench-v2 & ZeroScrolls Benchmarks

Updated 17 December 2025

LongBench-v2 is a benchmark suite that systematically evaluates long-context reasoning across diverse domains such as code, dialogue, and structured data using accuracy, EM, and F1 metrics.
ZeroScrolls is a zero-shot benchmark focusing on natural language understanding of extensive, naturally-authored documents through tasks like summarization, QA, and aggregation.
Both benchmarks expose key challenges like score dilution in static self-attention and demonstrate effective strategies like qTTT to enhance LLM inference over extended contexts.

LongBench-v2 and ZeroScrolls are two leading benchmarks designed to evaluate the capabilities of LLMs over extremely long textual contexts, typically in the range of $10^4$ to $10^6$ tokens. They address distinct methodological needs: LongBench-v2 primarily focuses on systematic evaluation of long-context reasoning across multiple domains and task formats, while ZeroScrolls emphasizes zero-shot natural-language understanding over naturally occurring, extensive documents and introduces unique aggregation and sorting tasks. Both benchmarks are central tools in the ongoing assessment and development of LLM architectures and inference-time strategies for handling long-range dependencies.

1. Benchmark Design and Coverage

LongBench-v2

LongBench-v2 comprises six task domains, each presenting either multiple-choice or extractive-answer queries over lengthy documents:

Code Repositories: Tasks require localization of function call arguments or bug spans across multi-file project trees.
Long Dialogue History: Models must answer questions whose evidence is embedded in chat transcripts exceeding 20K tokens.
Long Structured Data: Queries target specific facts within tabular or JSON-like content spanning thousands of tokens.
Long In-Context (Synthetic): “Needle-in-haystack” QA over incrementally growing synthetic text.
Multi-Document QA: Synthesis of information from multiple documents, each typically $2$–$5$K tokens.
Single-Document QA: Questions reference targeted information within single 8–20K token documents.

Context lengths for individual tasks typically range from 5K to 50K tokens, with each domain designed to probe either fine-grained retrieval, multi-hop reasoning, or compositional understanding. Evaluation uses accuracy for multiple-choice/QA tasks; when both Exact Match (EM) and token-level F1 are reported for span questions, EM serves as the primary score. For a predicted answer span $P$ and reference span $G$ : $\mathrm{Precision} = \frac{|P \cap G|}{|P|}, \quad \mathrm{Recall} = \frac{|P \cap G|}{|G|}, \quad \mathrm{F1} = \frac{2\cdot \mathrm{Prec} \cdot \mathrm{Rec}}{\mathrm{Prec} + \mathrm{Rec}}$ Final scores are averaged across the official development sets (Bansal et al., 15 Dec 2025).

ZeroScrolls

ZeroScrolls extends the SCROLLS design, removing any requirement for supervised fine-tuning, and includes ten tasks (four summarization, four question answering, two aggregation/sorting), all in the zero-shot regime. Domains include:

Summarization: U.S. Congressional reports (GovReport), TV scripts (SummScreenFD), meeting transcripts (QMSum), and query-focused story summarization (SQUALITY).
Question Answering: Full research papers (Qasper), multi-hop Wikipedia QA (MuSiQue), narrative comprehension (NarrativeQA), and long-form multiple-choice (QuALITY).
Aggregation and Sorting: Sentiment aggregation over 50 hotel reviews (SpaceDigest), and chapter ordering in narrative texts (BookSumSort).

Inputs typically range from 3K to 50K tokens per example. Each task uses a metric aligned with its output type: ROUGE geometric mean, F1, accuracy, exponential similarity (ES) for aggregation, and concordance index (Cidx) for sorting (Shaham et al., 2023).

2. Evaluation Protocols and Metrics

LongBench-v2 evaluates models independently per example, averaging scores over domains. For extractive QA, EM and F1 are both measured but EM is primary. For multiple-choice, accuracy is calculated as $(\#\, \mathrm{correct})/\mathrm{total} \times 100$ .

ZeroScrolls enforces a canonical zero-shot format: each example comprises natural-language instructions, the long context, and an output to be generated without any in-context exemplars. The gold references for test and small validation splits are private, supporting a live leaderboard. Metrics for evaluation include:

$\mathrm{ROUGE}_{\text{geo}} = \left(\mathrm{ROUGE}\text{-}1 \times \mathrm{ROUGE}\text{-}2 \times \mathrm{ROUGE}\text{-}L\right)^{1/3}$ for summarization,
F1 (maximum over multi-reference set) for extractive and classification QA,
Exponential Similarity $\mathrm{ES}(p,\hat{p}) = 2^{-\frac{|p-\hat{p}|}{10}}$ for sentiment aggregation,
Concordance index for ordering tasks.

A key design distinction is that ZeroScrolls excludes any training set, enforcing rigorous zero-shot evaluation and compelling LLMs to operate via prompt-based transfer alone.

3. Empirical Performance and Model Comparison

Empirical results from “Let’s (not) just put things in Context …” demonstrate acute score dilution in static self-attention as context grows, with baseline in-context accuracy on LongBench-v2 collapsing for longer input lengths (Bansal et al., 15 Dec 2025). Three inference-time strategies are compared for Qwen3-4B:

In-Context Only: No additional inference computation beyond initial forward pass.
Thinking (Chain-of-Thought): Generation of $T_{\mathrm{think}} = 8192$ intermediate tokens for reasoning.
Query-Only Test-Time Training (qTTT): 32 steps of adaptation (span size $k = 128$ ) of query-projection matrices using cross-entropy loss over next-token prediction, leaving other parameters fixed.

For compute-matched conditions ( $T_{\mathrm{think}} \approx 2 N_{\mathrm{TTT}} k$ ), the following average scores are observed:

Regime	LongBench-v2 (%)	ZeroScrolls (%)
In-Context	27.0	18.4
Thinking	33.5	23.9
qTTT	39.6	32.5

qTTT delivers +12.6 and +14.1 percentage-point gains, respectively, relative to baseline in-context inference. On domains requiring pinpoint retrieval (Code Repos, Multi-Document QA, MuSiQue, QA uLITY), qTTT’s margin exceeds 20 pp under matched computational budgets.

ZeroScrolls scores from the original benchmark paper (Shaham et al., 2023) show GPT-4 (41.7) and Claude (39.1) leading among zero-shot LLMs, with naive baselines trailing far behind on aggregation and long-range ordering.

4. Theoretical Underpinnings and Score Dilution

Score dilution arises as an intrinsic limitation of static self-attention at large context lengths. Given $m$ distractors with attention logits within margin $\Delta$ of the target, the attention mass on the target satisfies: $\alpha_{i,j^*} \le \frac{1}{1 + m\,e^{-\Delta}}$ Thus, for constant-gap $\Delta$ and $m \propto T$ , the attention on the correct token approaches zero as $T \to \infty$ (“Score Dilution Lemma”). To guarantee a fixed minimum mass, the logit margin must scale as $\Omega(\log T)$ . “Thinking” tokens (intermediate autoregressive outputs) cannot overcome dilution because their own mass on the “needle” position becomes vanishingly small in long contexts.

The qTTT method addresses this by directly updating query projection matrices via the following loss: $\mathcal{L}_{TTT}(\theta; x_{t:t+k}) = -\sum_{i=t}^{t+k-1} \log p_\theta(x_{i+1} \mid x_{1:i}; \{K^{(\ell)}, V^{(\ell)}\})$ with gradients taken only with respect to $\{W_Q^{(\ell)}\}$ . Gradient descent increases the target-distractor logit margin, converting $O(1)$ separation into the necessary $\Omega(\log T)$ separation, thus provably restoring attention on the correct positions (Bansal et al., 15 Dec 2025).

5. Benchmark Comparison, Limitations, and Open Challenges

LongBench-v2 and ZeroScrolls complement each other in evaluation focus. LongBench-v2 emphasizes systematic, often synthetic, extraction and retrieval challenges over long input, with a context range extending to 50K tokens and beyond. ZeroScrolls uniquely targets zero-shot transfer to naturally-authored, multi-domain documents, introducing information-aggregation and ordering tasks not present in other benchmarks.

A summary of ZeroScrolls tasks and performance (normalized scores, (Shaham et al., 2023)):

Model	QA/MC (QuALITY)	Summarization (GovReport)	Aggregation (SpaceDigest)	Avg Score
GPT-4	89.2	26.3	62.8	41.7
Claude	84.8	24.2	61.6	39.1
Naive baseline	26.6	22.6	45.0	19.6

Persistent open challenges highlighted by ZeroScrolls include:

Aggregation (classification, counting, arithmetic over long inputs): most models fail to surpass naive baselines on these tasks.
Zero-shot summarization: substantial performance gap with respect to fine-tuned state-of-the-art.
Strict automatic metrics versus output formatting drift: n-gram metrics often penalize semantically valid but format-inconsistent outputs; e.g., GPT-4’s low F1 despite superior human annotations.
Robust semantic evaluation metrics and dynamic prompt engineering remain active research areas.

6. Significance and Research Directions

Recent advances, as evidenced by Bansal et al. (Bansal et al., 15 Dec 2025), reveal that limited test-time parameter adaptation is substantially more effective than expanded autoregressive output (“thinking tokens”) under fixed computational budgets for long-context LLMs. The reported gains— $+12.6$ and $+14.1$ pp for LongBench-v2 and ZeroScrolls, respectively—represent the largest known improvements on these benchmarks without changing model architecture or pretraining data. Notably, these improvements are robust across model sizes, and ablations confirm that the primary effect is mitigation of score dilution, not rectification of positional encoding failures.

ZeroScrolls further establishes a high bar for zero-shot long-document reasoning, with tasks measuring not only classic QA and summarization but also aggregation, sorting, and multi-hop reasoning under conditions of minimal task-specific adaptation. This combination of benchmarks now constitutes the de facto standard for evaluating long-context LLMs and for guiding future architectural and inference-time innovations in the field.

Comparison to other benchmarks such as SCROLLS, HELM, and BigBench highlights the unique contributions of LongBench-v2 and ZeroScrolls:

SCROLLS allows fine-tuning per task; ZeroScrolls compels true zero-shot evaluation.
HELM/BigBench cover diverse tasks but are skewed toward short-text input ( $<$ 1000 tokens), rendering them insufficient for long-range reasoning.
LongBench-v2 supports long-context evaluation in both synthetic and natural domains and operates primarily in a supervised evaluation regime.
ZeroScrolls remains the only benchmark enforcing private, leaderboard-based evaluation for zero-shot, naturally-authored, extra-long documents (Shaham et al., 2023).

Together, these benchmarks serve as key drivers of progress in long-context language modeling, evaluation methodology, and inference-time adaptation strategies.