LongBench-v2 Benchmark Suite

Updated 27 February 2026

LongBench-v2 is a benchmark suite that rigorously tests LLMs on long-context tasks and multi-step reasoning using diverse real-world data.
It employs challenging four-choice MCQs from varied domains, ensuring deep comprehension beyond mere extraction.
The evaluation protocol benchmarks both human and LLM performance, advancing insights in model scaling and reasoning strategies.

LongBench-v2 is a benchmark suite designed to rigorously evaluate the deep understanding and reasoning abilities of LLMs over realistic, long-context sequences and multitask settings. It introduces a paradigm shift in long-context benchmarking by emphasizing questions that require multi-step, symbolic, and analytic reasoning across heterogeneous data at real-world scale. The benchmark spans six primary task families, incorporates highly challenging human-generated multiple-choice questions (MCQs) with contexts of up to 2 million words, and establishes both expert human and LLM performance baselines. LongBench-v2 uniquely tests the limits of long-context attention, inference-time compute scaling, and model generalization to tasks reflective of complex professional and academic workflows (Bai et al., 2024).

1. Motivation, Design Principles, and Scope

LongBench-v2 addresses critical deficiencies in prior long-context LLM benchmarks, which primarily probe extractive or synthetic tasks, utilize unreliable overlap metrics (e.g., ROUGE, F₁), and cannot distinguish deep reasoning from surface-level retrieval. The benchmark specifically aims to determine whether modern LLMs with extended context windows (8k to over 1M tokens) genuinely exercise deep comprehension and reasoning, rather than simply exploiting memorization or context retrieval.

Key design principles include:

Stressing real-world complexity by constructing tasks on actual academic, literary, legal, financial, and technical documents.
Structuring all evaluation via high-difficulty, four-choice MCQs that are non-trivial for both humans and leading LLMs, eliminating shortcut patterns and shallow pattern-matching.
Focusing on six heterogeneous, high-level task families: single-document QA, multi-document QA, long in-context learning, long-dialogue history, code repository understanding, and structured data reasoning.
Ensuring contexts span the full range of “long-context,” from 8,000 to 2,000,000 words (median ≈ 54k; mean ≈ 104k).
Establishing both adversarial and calibrated difficulty through a dual-stage process involving LLM and human expert curation.

2. Benchmark Structure and Task Distribution

LongBench-v2 comprises 503 MCQs with substantial context diversity and length, distributed as follows:

Task Family	# Items	Median Context Size	Subdomains/Modalities
Single-Document QA	175	51k words	Academic, literary, legal, financial, governmental, detective, plot ordering
Multi-Document QA	125	34k words	Academic, legal, financial, governmental, news multi-source
Long In-Context Learning	81	71k words	Device/software manuals, rare language vocabularies, many-shot classification
Long-Dialogue History Understanding	39	25k words	Multi-agent game transcripts, user–assistant chat logs
Code Repository Understanding	50	167k words	Multi-file codebase QA
Long Structured-Data Understanding	33	49k words	Large tables, knowledge graphs

Each question is authored with the constraint that answering cannot rely on shallow retrieval or trivial information extraction; instead, reasoning across large evidence bases, synthesizing information, and resolving ambiguities are typically required.

3. Data Collection, Annotation, and Quality Control

Ninety-seven annotators, each with advanced academic or professional backgrounds, contributed original source documents spanning disciplines. The data pipeline enforces stringent quality:

Document submission, restricted to texts exceeding 8k words and low redundancy with existing corpus.
MCQ authoring, with annotators providing not only question and answer options but also supporting evidence.
Automated triage, where three advanced LLMs (GPT-4o-mini, GLM-4-Air, and GLM-4-Flash, each with ≥128k token windows) answer each question; consistently correct model predictions trigger revision to avoid triviality.
Manual curation by 24 human experts under a 15-minute time budget and the use of search tools, with incorrectly answered or quickly solved items again being revised.
Up to five annotation/revision cycles until items achieve both human and model-adversarial hardness.

The compensation scheme incentivizes not only successful pass rates and length but also substantive question difficulty (with difficulty bonuses when ≥2/3 models fail and review times exceed 10 minutes).

4. Evaluation Protocol and Performance Metrics

Evaluation is performed under three primary conditions: zero-shot prompting, zero-shot with chain-of-thought (CoT) reasoning, and using input truncation for overlength documents. All results use the following:

Accuracy: $\mathrm{Accuracy} = \frac{\#\,\text{correctly answered questions}}{\#\,\text{total questions}} \times 100\%$
Compensated accuracy: For invalid/no-answer outputs, credit is capped at the random baseline (25% accuracy).
Human baseline: Established by 24 professional reviewers (max 15 min/question), who achieve 53.7% accuracy overall.
Model baseline: Includes 10 open-source LLMs and 6 proprietary LLMs (with ≥128k context windows).

This protocol ensures statistical robustness, with ±4.4% confidence intervals at the 95% level for a 50% score.

5. Experimental Results and Comparative Analysis

Performance on LongBench-v2 reveals a significant gap between extractive and reasoning-focused long-context evaluation:

Best direct-answering system (GPT-4o, zero-shot+CoT): 51.2% accuracy.
Best overall, with extensive inference-time reasoning and longer CoT rollouts (o1-preview): 57.7% accuracy.
Relative gains: o1-preview surpasses the human expert baseline by 4.0%, and demonstrates a 32.7% absolute improvement over random guessing.

Prior long-context QA datasets (e.g., Needle-in-a-Haystack, LongBench v1) principally test extraction, on which top models already achieve near-ceiling performance (≈100% recall). On LongBench-v2, deep-reasoning requirements lower peak model accuracy to ~58%, well below the “saturation” regime.

Experiments on retrieval-augmented generation (RAG) show that merely feeding more context without structured reasoning yields diminishing returns beyond ~32k tokens, underscoring the necessity of deliberate multi-step analytic reasoning.

6. Key Insights, Challenges, and Methodological Implications

LongBench-v2 establishes that inference-time compute scaling (via CoT and iterative reasoning) yields substantial accuracy gains (+7–8% improvement), more so than nominal increases in model size or context window alone. The top-performing systems benefit from explicit orchestration of multi-step symbolic reasoning, reflecting a departure from black-box end-to-end inference.

Challenges identified by the creators include:

Sustaining coherent reasoning chains over input contexts >100k tokens.
Handling information compression/memory trade-offs, i.e., retaining critical evidence while pruning irrelevant bulk data at scale.
Integrating heterogeneous data types (text, source code, tables, knowledge graphs) within a unified reasoning pipeline.
Developing robust training curricula and objectives to foster actual deep reasoning, instead of superficial pattern-matching.

7. Impact, Applications, and Future Directions

LongBench-v2 functions not simply as a leaderboard tool but as a diagnostic benchmark pointing towards future LLM model, system, and training innovations:

Emphasizes the importance of scaling inference-time reasoning dimensions, not just parameter count or context width.
Encourages exploration of hybrid approaches, such as RAG plus targeted CoT, wherein salient context chunks are reliably retrieved and reasoned over at depth.
Suggests that architecture-level advances in attention mechanisms, hierarchical memory, and dynamic context selection are necessary for continued progress.
Highlights the gap between state-of-the-art LLMs and true generalist reasoning capacity, with current models only marginally outperforming experts on these highly complex multitasks.
Serves as a paradigm for constructing future benchmarks that mirror the breadth and intricacy of complex professional tasks.

External work leveraging LongBench-v2, such as unsupervised document reconstruction for reinforcement learning with verifiable rewards (RLVR), demonstrates that pretraining on document coherence tasks can yield up to +3% absolute accuracy gains across 128k-token windows on LongBench-v2, even in the absence of curated QA supervision (Xiao et al., 9 Feb 2026). This suggests that general-purpose long-context representation and reasoning capacity can be enhanced via auxiliary objectives specifically targeting global document structure.

LongBench-v2 sets a new standard for evaluating and developing LLMs equipped for real-world, high-stakes tasks that demand deep, scalable, and contextually disciplined reasoning (Bai et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks (2024)

Document Reconstruction Unlocks Scalable Long-Context RLVR (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongBench-v2.