LongBench: Long-Context LLM Benchmark Suite

Updated 4 June 2026

LongBench is a bilingual, multitask benchmark suite that evaluates LLMs’ ability to process and reason over extended texts in real-world scenarios.
It employs a unified JSON schema and diverse metrics for tasks such as single- and multi-document QA, summarization, few-shot learning, synthetic reasoning, and code completion.
The benchmark offers actionable insights on optimizing long-context performance, highlighting challenges like window scaling and memory-efficient inference techniques.

LongBench is a bilingual, multitask benchmark suite explicitly designed to rigorously evaluate long-context understanding in LLMs, with a focus on real-world scenarios requiring reasoning across extended input sequences. Established to address gaps in both the substantive coverage and granularity of prior benchmarks—which typically emphasized short contexts, lacked complex reasoning requirements, or failed to disentangle superficial from deep comprehension—LongBench has become a central reference point for both model evaluation and development of long-context LLM research (Bai et al., 2023).

1. Motivation and Design Principles

LongBench was introduced to assess LLMs’ ability to process and reason over texts orders of magnitude longer than what prior task suites (e.g., GLUE, MMLU, SuperGLUE) demanded. It responds to two critical observations:

Real-world tasks (e.g., book-scale comprehension, cross-document QA, repository-level code analysis) require efficient and robust handling of contexts well beyond several thousand tokens.
Existing benchmarks were either synthetic and lacked realism or evaluated primarily extractive or lookup-based tasks, offering little insight into reasoning depth or context integration.

The benchmark defines explicit requirements:

Long, naturalistic source texts and documents (average English context: 6,711 words; Chinese: 13,386 characters).
Bilingual coverage (English and Chinese) for each task.
Multitask scope: single- and multi-document QA, summarization, few-shot learning, synthetic reasoning challenges, and code completion.
Unified input-output schemas allow systematic, automated evaluation across heterogeneous tasks.

2. Benchmark Composition and Data Schema

LongBench encompasses 21 datasets across six categories (see Table 1):

Category	Example Datasets	Language(s)	Metric	Avg. Input Length
Single-Doc QA	NarrativeQA, Qasper	EN, ZH	F1	up to 18,409 words
Multi-Doc QA	HotpotQA, MuSiQue	EN, ZH	F1, ROUGE-L	up to 15,768 chars
Summarization	GovReport, QMSum	EN, ZH	ROUGE-L	up to 15,380 chars
Few-Shot Learning	TREC, TriviaQA	EN, ZH	Accuracy, F1	up to 91,410 tokens
Synthetic	PassageCount, PassageRetrieval	EN, ZH	EM	up to 11,141 words
Code Completion	LCC, RepoBench-P	EN	Edit-Similarity	up to 4,206 tokens

All datasets are recast into a standardized JSON structure:

{
  "id": "<unique sample ID>",
  "instruction": "<task-specific instruction template>",
  "input": "<I>",
  "context": "<C>",
  "output": "<reference answer A>"
}

Input prompts concatenate instruction, input, context, and elicit model outputs directly comparable to gold answers via automatic scripts (Bai et al., 2023).

3. Evaluation Metrics and Protocols

Each task uses domain-appropriate, objective metrics, always computed automatically:

QA: F1 (token overlap), EM (Exact Match) for extractive spans.
Summarization: ROUGE-L (longest common subsequence F1).
Classification: Accuracy (first output line matches reference).
Code: Edit-Similarity (normalized Levenshtein).
Synthetic: Task-specific EM or count accuracy.

Evaluation proceeds under zero-shot or few-shot regimes (the latter embedding in-context exemplars in the “input” slot). For contexts exceeding the model’s native window, inputs are truncated (keep first ⌊M/2⌋ and last ⌊M/2⌋ tokens). All evaluation is fully automated, with no human intervention needed per run.

4. Empirical Findings and Impact on Model Development

Comprehensive results across eight LLM baselines reveal:

Superiority of long-context optimization: Models with explicit RoPE scaling or long-sequence fine-tuning substantially outperform vanilla parameter scaling (e.g., ChatGLM2-6B-32k achieves +62% relative gains over the base model at the same parameter count).
Retrieval and summarization-based context compression: Token chunking with semantic retrieval aids undertrained models but never matches true long-context optimization.
Context window scaling limitations: Even strong commercial models (e.g., GPT-3.5-Turbo-16k) exhibit sharp performance drops as context length exceeds 8k tokens (e.g., –17 points from 0–4k to >8k input lengths on LongBench-E) (Bai et al., 2023).

Key technical insight: length-dependent robustness cannot be inferred from short-input scores; window scaling must be benchmarked via even-length splits (LongBench-E). The unified benchmark design exposes degradation patterns obscured by single-length averages.

5. Extensions and Critiques

Length-Controllable and Baseline-Adjusted Benchmarks:

Subsequent research identified two major limitations in LongBench’s original structure:

Short/long-context conflation: Raw scores aggregate base-language ability and true long-context reasoning, occluding the differential impact of extended context (Yang et al., 25 May 2025).
Fixed input lengths: Static corpus lengths risk obsolescence as usable context windows grow.

To address this, 100-LongBench and later benchmarks advocate reporting both a “base short-context” score and a normalized long-context “LongScore”: $\text{LongScore}_\ell = \frac{S_\ell - \text{BaseAbility}}{\text{BaseAbility}}$ where $S_\ell$ is the average task score at length $\ell$ , and BaseAbility is the mean of scores at 2k, 4k, 6k lengths. Empirically, this reveals collapse points (commonly at 64k–128k), and enables robust model ranking at length extremes (Yang et al., 25 May 2025).

Bilingual and Multi-Difficulty Scaling:

LongBench Pro (Chen et al., 6 Jan 2026) extends scope to 1,500 in-the-wild samples (750 EN, 750 ZH, 11+ domain tasks) with rigorous human-model collaborative construction, six length bins (8k–256k), and four explicit difficulty levels. Metrics include task-specific F1, ROUGE-L, NDCG@k, semantic similarity, and pairwise accuracy, with cross-lingual gaps ( $\Delta_{\mathrm{CL}}$ ) surfaced and effective context length directly measured. Key finding: long-context method optimization generally outweighs raw parameter scaling in determining robust performance at ultra-long sequences.

6. Adaptations for Specialized Contexts

Caching, Memory, and Chunking:

LongBench underpins recent advances in context caching and memory-efficient inference. Techniques such as Lookahead Q-Cache and NestedKV leverage LongBench as a stress test for cache eviction and compression under tight memory budgets (Wang et al., 24 May 2025, Chen et al., 26 May 2026). Results report +1–4 point gains from lookahead query-based eviction (Qwen2.5-7B, Llama3.1-8B; $B=128..512$ tokens), and up to 19.3 points improvement over classic KeyDiff at aggressive cache retention rates ( $r=0.75,0.95$ ) (Chen et al., 26 May 2026).

Chunk-Based and Recap Learning:

Chunk-based inference (OPRM) for recurrent LLMs shows 14–51% relative improvement on overall LongBench scores by focusing computation on the single-relevant context chunk and avoiding recurrent memory overflow (Ben-Kish et al., 12 May 2025). Active recap learning (ARL) achieves 9.44% relative improvement by distilling critical context into recursive, generated summaries, supporting scalable memory without intervention in model architecture (Hui, 20 Jan 2026).

Robotics and Multimodal Extensions:

LongBench variants have been adapted for real-word robotic evaluation (over 1,000 long-horizon manipulation episodes; dual regime: context-independent vs. context-dependent) (Chen et al., 18 Apr 2026), and text-to-image evaluation (LongBench-T2I), which introduces compositional visual benchmarks with multidimensional scoring across object, layout, text, and special effects fidelity (Zhou et al., 30 May 2025, Lin et al., 17 Aug 2025).

7. Recommendations for Future Long-Context Evaluation

Based on both the strengths and critiques of LongBench and its extensions:

Disentangle short-context baseability from long-context extension using metrics like LongScore.
Use length-controllable task instances to expose per-model breakdown thresholds.
Incorporate bilingual, multi-task, and multi-difficulty scaling, and include domain-specific metrics aligned with application needs.
Adopt human-model collaborative pipelines for scalable, high-quality sample construction.
Track effective context length (not only nominal token window), and analyze performance degradation across explicit length buckets.
When reporting performance, detail both global average and per-length, per-task curve statistics to ensure robust, actionable comparisons (Yang et al., 25 May 2025, Chen et al., 6 Jan 2026).

The open-source nature, methodological rigor, and extensibility of LongBench have made it the foundation of advancements in long-context LLM training, evaluation, and deployment. Emerging lines of research continue to leverage and refine LongBench variants for improved coverage, diagnosis, and benchmarking in this rapidly evolving domain (Bai et al., 2023, Yang et al., 25 May 2025, Chen et al., 6 Jan 2026).