Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
107 tokens/sec
Gemini 2.5 Pro Premium
58 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
84 tokens/sec
GPT OSS 120B via Groq Premium
463 tokens/sec
Kimi K2 via Groq Premium
200 tokens/sec
2000 character limit reached

LongBench: Bilingual Long-Context Benchmark

Updated 2 August 2025
  • LongBench is a bilingual, multitask benchmark designed to evaluate long-context understanding in large language models using 21 diverse datasets.
  • It standardizes evaluation across six categories—including single-document and multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion—with data in both English and Chinese.
  • Its unified evaluation protocol and context truncation methods reveal significant performance drops in LLMs as input lengths increase, guiding strategies for improved long-sequence processing.

LongBench is a bilingual, multitask benchmark specifically constructed for the rigorous evaluation of long context understanding in LLMs. Designed to address the limitations of prior benchmarks—in particular, their inability to test model performance on extended sequences at realistic context lengths—LongBench standardizes 21 datasets into a unified evaluation framework, covering six principal categories and spanning both English and Chinese. This benchmark provides comprehensive coverage across domain, language, and task type, enabling nuanced assessment of LLM capabilities and limitations as context lengths increase (Bai et al., 2023).

1. Dataset Composition and Organization

LongBench comprises 21 datasets, partitioned across six main task categories: single-document question answering (QA), multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. Each dataset contributes to the evaluation of distinct long-context skills, with average example lengths of 6,711 words (English) and 13,386 characters (Chinese), thereby deliberately challenging models well beyond traditional short-context settings.

Category Example Datasets Language(s)
Single-Document QA NarrativeQA, Qasper, MultiFieldQA EN, ZH
Multi-Document QA HotpotQA, 2WikiMultihopQA, DuReader EN, ZH
Summarization GovReport, QMSum, MultiNews, VCSUM EN, ZH
Few-Shot Learning TREC, TriviaQA, SAMSum, LSHT EN, ZH
Synthetic Tasks PassageCount, PassageRetrieval EN, ZH
Code Completion LCC, RepoBench-P EN

Key differentiators include:

  • Bilingual coverage: Each functional category contains representative tasks in both English and Chinese.
  • Domain diversity: Sources span narrative fiction, scientific articles, legal and governmental texts, meeting transcripts, open-domain dialogues, and software repositories.
  • Synthetic design: Synthetic and code completion tasks are explicitly crafted to probe model handling of long-range dependencies and extended codebases.

2. Task Categories and Problem Structures

Each of the six categories addresses unique challenges linked to long input sequences:

  • Single-Document QA: Models must locate and extract concise answers from long, information-dense narratives or articles. For example, NarrativeQA requires processing entire novels or film scripts against complex queries.
  • Multi-Document QA: These tasks (like HotpotQA or 2WikiMultihopQA) necessitate multi-hop reasoning across concatenated passages or distractors, with correct answers only derivable via global context integration.
  • Summarization: Both meeting transcripts and governmental reports are provided, with demand for output that synthesizes major themes under strong length and semantic constraints (e.g., QMSum, GovReport).
  • Few-Shot Learning (In-Context Learning): Context is augmented to comprise multiple short exemplars, appended to evaluation queries to test in-context learning over nontrivial context sizes.
  • Synthetic Tasks: PassageCount evaluates exact sequence traversal and counting in the presence of duplicates and shuffling; PassageRetrieval requires pinpointing the correct segment from a large set.
  • Code Completion: LCC (file-level) and RepoBench-P (repository-level) tasks measure the model’s ability to predict code lines given thousands of tokens of code, including integration across files.

All datasets are converted to a uniform triple (I, C, A): input (e.g., question or prompt), context (extended text), and answer (target span or text).

3. Evaluation Protocols and Metrics

LongBench adopts a unified evaluation methodology tailored by task type:

  • QA Tasks: F1 score—to reflect partial/overlapping correct spans.
  • Summarization: ROUGE-L—to quantify n-gram overlap between generated and reference summaries.
  • Classification and Synthetic: Accuracy or exact match—for discrete or count-based outputs.
  • Code Completion: Edit Similarity (Levenshtein-based)—detecting line-level correctness.

Inputs exceeding a model’s maximum context window are truncated via:

S1:L[S1:M/2;SLM/21:L]S_{1:L} \rightarrow [\,S_{1:\lfloor M/2\rfloor};\,S_{L-\lfloor M/2\rfloor - 1 : L}\,]

where MM is the model context limit, ensuring the retention of both leading and trailing critical segments.

Further, LongBench-E is designed to enable analysis of performance as a function of context length (0–4k, 4–8k, 8k+ tokens), facilitating explicit ablation studies on context sensitivity.

4. Empirical Findings on LLM Performance

The benchmark provides a comprehensive evaluation of eight LLMs, uncovering several core findings:

  • Performance drop with increased context: Even high-capacity commercial models (e.g., GPT-3.5-Turbo-16k) exhibit up to 17% accuracy loss as input grows from 0–4k to 8k+ tokens.
  • Improvements via fine-tuning: Models extended via scaled positional embeddings and additional long-context training (e.g., ChatGLM2–6B–32k, LongChat–v1.5–7B–32k) achieve 62% and 19% relative improvements on long-context tasks, respectively.
  • Task and language correlations: Tasks within the same category or language are more strongly correlated in outcomes, but synthetic tasks (especially PassageCount) act as outliers—serving as more discriminative indicators of core long-context capabilities.
  • Compression-based enhancements: Retrieval- or summarization-based condensation of context can yield gains for models with otherwise limited long-sequence ability, though even the best retrieval-compressed models cannot match those inherently trained for longer contexts.

5. Model Training and Enhancement Strategies

Three main strategies are identified for improving long context performance in LLMs:

  • Scaled Position Embedding: Adjusting positional encodings allows the transformer attention mechanism to generalize beyond its original context window.
  • Long-Sequence Fine-tuning: Introducing extended sequence data into supervised fine-tuning regimens substantively boosts long-context comprehension and recall.
  • Context Compression: Splitting and retrieving most-relevant context chunks (via embedding-based retrievers like text-embedding-ada-002, Contriever, or BM25) enables weak long-context models to partially mitigate input window constraints. Similarly, segment-based summarization before inference can help, though with variable task-dependent success.

While these techniques reduce degradation for lengthy inputs, fully closing the performance gap requires advances in innate model architecture and supervision.

6. Dataset Availability and Usage

LongBench (all 21 datasets) and the unified evaluation codebase are public and structured for reproducibility, facilitating seamless integration into LLM benchmarking pipelines. The format standardization assists in automatic evaluation and direct comparison. Instructions and download details are made available through a GitHub repository, typical of major benchmarks in the field.

7. Significance and Outlook

LongBench marks a shift from single-domain or synthetic sequence evaluation toward a realistic, multi-domain, bilingual, multitask setting for long-context understanding. Empirical evidence underscores universal performance challenges as context lengths increase, along with the limited gains offered by naive retrieval or compression. The benchmark exposes critical gaps—including sensitivity to positional encoding, the need for robust span selection over vast contexts, and variable language-task coupling—that are not visible at shorter sequence lengths.

The inclusion of unified context truncation logic and length-stratified evaluation sets a methodological precedent for subsequent benchmarks. Adoption of LongBench is expected to accelerate research into scaling, memory, and compositional reasoning for LLMs, especially as application domains (e.g., legal, scientific, codebase-related) increasingly rely on models navigating and integrating information from extended textual inputs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube