Papers
Topics
Authors
Recent
Search
2000 character limit reached

LooGLE v2: Benchmark for Long-Context Reasoning

Updated 4 July 2026
  • LooGLE v2 is a benchmark for assessing long-dependency reasoning in large language models using real-world documents from law, finance, game, and code domains.
  • It challenges models with multi-hop tasks over extremely long contexts (16K to 2M tokens) that test the integration of dispersed evidence rather than simple retrieval.
  • The benchmark employs a scalable automatic data curation pipeline and closed-form evaluations across 10 specialized task types to diagnose practical reasoning failures.

Searching arXiv for the cited LooGLE and LooGLE v2 papers to ground the article in the latest preprints. LooGLE v2 is a benchmark for evaluating whether LLMs can handle real-world long-dependency reasoning over long contexts, rather than merely accept large token windows or solve localized retrieval tasks. It is designed around automatically collected real-world long texts spanning law, finance, game, and code, with context lengths from 16K to 2M tokens, over 500 domain-specific long documents with an average length of 256K tokens, and 1,934 QA instances organized into 10 task types. The benchmark’s central claim is that there is a major gap between the claimed context window size of modern LLMs and their effective ability to use long inputs for practical reasoning; in the reported evaluation, even the best-performing model, GPT-4.1, achieves only a 59.2% overall score (He et al., 26 Oct 2025).

1. Position within long-context evaluation

LooGLE v2 is a successor to LooGLE, the earlier Long Context Generic Language Evaluation benchmark that was introduced to test “true long-context understanding” rather than only short-span extraction or shallow reading comprehension. LooGLE emphasized newer post-2022 documents, generic cross-domain coverage, and a distinction between short-dependency and long-dependency tasks. It contained 776 documents, 6,448 total questions, and 1,101 high-quality human-annotated long-dependency QA pairs, with average document length reported as 19,367 words and 24,005 tokens per document (Li et al., 2023).

LooGLE v2 narrows and deepens that agenda. Instead of emphasizing generic cross-domain long-context use, it targets real-world applications that were described as rarely benchmarked: legal case analysis, financial statement reasoning, game-state inference, and code dependency analysis. The motivating distinction is between long context capacity and long-context comprehension. In this framing, accepting 128K, 1M, or more tokens is not equivalent to integrating dispersed evidence across those inputs. LooGLE v2 is therefore positioned against benchmarks that mostly emphasize document QA, retrieval, simple reading comprehension, synthetic or stitched long texts, or noisy and weakly grounded annotations (He et al., 26 Oct 2025).

This shift also changes the operative definition of difficulty. In LooGLE v2, the difficult case is not “needle in a haystack” retrieval, but tasks in which the answer depends on connecting multiple dispersed pieces of evidence across the full context, sometimes across multiple files or years. A plausible implication is that LooGLE v2 is intended less as a generic stress test of context length and more as a diagnostic instrument for practical long-context failure modes in professional domains.

2. Corpus design and domain coverage

The benchmark is built from real-world long texts across four domains: law, finance, game, and code. In law, the source materials include U.S. legal case documents, cited legal articles, and related precedent cases. In finance, the core documents are 10-K annual reports from public companies. In game, the sources are CS2 match replays and Crafter trajectories from LLM agents or humans. In code, the sources are Python repositories together with version histories and commits from real projects (He et al., 26 Oct 2025).

The reported scale combines breadth and extreme length. LooGLE v2 contains over 500 domain-specific long documents, with an average length of 256K tokens. Many documents exceed 512K or even 1M tokens, and the overall context-length range is 16K to 2M tokens. The QA layer contains 1,934 instances across 10 task types. The paper explicitly presents the benchmark as scalable, stating that both the number of documents and the number of QA instances can be expanded using the same pipeline (He et al., 26 Oct 2025).

The 10 task types are distributed across the four domains as follows:

Domain Task types Core requirement
Law Legal Article Extraction; Legal Case Retrieval Recover masked citations from candidate libraries
Finance Metric Calculation; Trend Analysis; Cross-Company Comparison Numerical extraction, temporal reasoning, cross-document comparison
Game Environmental Understanding; User Behavior Analysis; Rule Understanding Infer global game state from long event sequences
Code Call Graph Analysis; Version Control Cross-file dependency tracing and change inference

These task definitions are deliberately domain-specific. Legal Article Extraction masks a cited legal article in a legal case document and requires selection from related laws plus distractors. Legal Case Retrieval masks a cited precedent case and requires identifying the correct related case from candidate cases. Finance tasks require extracting numerical values, computing derivative metrics, analyzing temporal changes, or ranking companies by computed quantities. Game tasks require inferring the map or environment, player behavior patterns, or rule-governed outcomes from trajectories or replay logs. Code tasks require recovering call-chain structure or identifying which files changed between repository versions or commits (He et al., 26 Oct 2025).

The benchmark’s domain design encodes a specific hypothesis: realistic long-context difficulty arises when the model must reconstruct latent structure from long, heterogeneous evidence streams. In law this structure is doctrinal and citation-based; in finance it is numerical and temporal; in game it is state-based and sequential; in code it is dependency- and version-based.

3. Data curation pipeline

LooGLE v2 emphasizes a scalable automatic data curation pipeline. Although the implementation is domain-specific, the shared pattern is to collect long real-world documents, extract structure, transform that structure into text, and automatically generate QA pairs with evidence. This choice differentiates the benchmark from manually stitched corpora and from synthetic long-context tasks that do not preserve the statistical and structural regularities of professional documents (He et al., 26 Oct 2025).

In the law pipeline, U.S. legal cases are downloaded from CourtListener and Westlaw, with a focus on cases published after 2024 to reduce leakage. Citation links are extracted from HTML, and both a reference article library and a reference case library are built. Target cases are then masked using placeholders such as <MASK_i>, with candidate articles represented as <LAW_i> and candidate cases as <CASE_i>. The law dataset includes 33 legal case documents plus their reference libraries.

In the finance pipeline, 180 10-K annual reports are downloaded from SEC EDGAR, spanning 2020–2024. The SEC API and edgar_sec are used to extract base metrics, derivative metrics computable from formulas are filtered, and questions are generated by template-based sampling. The design is specialized for metric calculation, temporal comparison, and cross-company comparison. Reported base metrics include Revenue, Gross Profit, Current Assets, Current Liabilities, Inventory (Net), Operating Cash Flow, and Capital Expenditure. Reported derivative metrics include Gross Margin (%), Quick Ratio, Working Capital, Free Cash Flow, and EBITDA. One sample formula given in the benchmark description is:

QuickRatio=CurrentAssetsInventoryNetCurrentLiabilities\text{QuickRatio} = \frac{\text{CurrentAssets} - \text{InventoryNet}}{\text{CurrentLiabilities}}

In the game pipeline, the CS2 branch collects 150 game records from HLTV, converts replay data into structured JSON using the CS2 demo parser, keeps only events relevant to win conditions and player performance, and converts the structured events into natural-language text using templates. The Crafter branch collects 100 trajectories, each recording actions, positions, inventory, and achievements, then computes stepwise deltas and turns them into readable descriptions. For CS2, the task generation process includes bomb plant counts, bomb defusal counts, site preference, and map inference. For Crafter, it includes obstacle locations, sleeping intervals, and action equivalence.

In the code pipeline, recent Python repositories from GitHub are filtered from the Jan 2024–2025 period, with fewer than 1000 stars to avoid contamination. The benchmark collects 40 repositories. For call graph analysis, call graphs are built with code2flow, and DFS is used to extract call chains of depth 2 to 5. For version control, six active repositories are selected, commit messages such as “fix” and “bug” are used, and git diff information is extracted to identify changed files (He et al., 26 Oct 2025).

This pipeline design suggests two benchmarking priorities. First, contamination control is handled operationally through source recency and filtering. Second, answerability is constrained through closed-form task formats and automatic evidence construction rather than open-ended question writing.

4. Task formalization and evaluation protocol

LooGLE v2 uses closed-form prompts and task-specific answer formats. Examples given include prompts requiring outputs such as “The correct answer is (LAW_x)” and “The correct answer is (CASE_x),” multiple-choice formats for game and code tasks, numeric formatting for finance tasks, and file-path lists for version control. This closed-form design is meant to improve evaluation robustness, though the benchmark also notes that closed-form evaluation is not universal and may underrepresent open-ended real-world reasoning (He et al., 26 Oct 2025).

The evaluation covers 10 LLMs: 6 locally deployed and 4 API-based. The locally deployed models are Qwen2.5-7B-Instruct-1M, LLaMA-3.1-8B-Instruct, Mistral-7B-Instruct-v0.2, GLM-4-9B-Chat-1M-HF, Phi-3-Medium-128K-Instruct, and Yarn-Mistral-7b-128k. The API-based models are GPT-4.1, GPT-o3-mini, DeepSeek-V3, and DeepSeek-R1. The appendix additionally reports larger parameter variants including Llama-3.1-70B-Instruct, Llama-3.3-70B-Instruct, Qwen2.5-32B-Instruct, Qwen2.5-72B-Instruct, QwQ-32B, and GLM-4-9b-chat-1M.

Prompt templates are customized per task. Sequences longer than a model’s context window are handled by middle truncation. Decoding uses temperature =0.1= 0.1, top-p =1.0= 1.0, and max generation length =512= 512 tokens. Evaluation runs on four A100 GPUs with 80 GB each, and local models are served using vLLM (He et al., 26 Oct 2025).

Scoring is heterogeneous and aligned to task type. Multiple-choice and classification tasks are evaluated by accuracy. Finance numerical tasks are counted correct if the relative error is within 5%. Version control uses Jaccard similarity between predicted changed files and gold changed files:

Jaccard(A,B)=ABAB\text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|}

where AA is the predicted changed paths and BB is the ground-truth changed paths.

The evaluation design differs from the original LooGLE in a consequential way. LooGLE mixed automatic metrics, GPT-4-as-judge evaluation, and human-oriented long-dependency QA construction, including BLEU, ROUGE, METEOR, BERTScore, exact match, partial match, GPT-4-based semantic equivalence, and sequence metrics for timeline reorder tasks such as LSD, LMD, SD, and SDD (Li et al., 2023). LooGLE v2 instead privileges directly checkable outputs tailored to professional tasks. A plausible implication is that this reduces ambiguity in scoring but also narrows the class of behaviors being measured.

5. Empirical results and failure modes

The headline result is that GPT-4.1 is the best-performing model with an overall score of 59.2%, despite the benchmark containing closed-form tasks and despite GPT-4.1 being evaluated as a frontier API-based system with a 1M-token context window. This result is used to support the broader claim that substantial room remains for improvement in practical long-context understanding (He et al., 26 Oct 2025).

The reported domain- and task-level scores for GPT-4.1 are uneven. In law, it scores 69.35 on Legal Article Extraction and 81.65 on Legal Case Retrieval. In finance, it scores 90.00 on Metric Calculation, 48.00 on Trend Analysis, and 72.50 on Cross-Company Comparison. In game, it scores 42.61 on Environmental Understanding, 71.34 on User Behavior Analysis, and 40.00 on Rule Understanding. In code, it scores 33.24 on Call Graph Analysis and 65.94 on Version Control. This pattern indicates that even the strongest system remains highly task-sensitive, with especially weak results on game environment inference, game rule understanding, and code call-graph reasoning.

Smaller local models often score much lower. Examples reported include 24.16 for LLaMA-3.1-8B-Instruct, 28.97 for Qwen2.5-7B-Instruct-1M, 25.81 for GLM-4-9B-Chat, 11.88 for Mistral-7B-Instruct-v0.2, and 3.26 for Yarn-Mistral-7b-128k. The benchmark description characterizes some of these scores as near random or below practical utility (He et al., 26 Oct 2025).

Several experimental conclusions follow from these results. First, long context window size does not guarantee effective long-context reasoning. The paper notes explicitly that GPT-4.1, despite a 1M-token window, can underperform GPT-o3-mini on some tasks involving very long inputs and multi-hop reasoning. Second, performance often degrades as input length increases across context-length bins. Third, multi-hop reasoning is harder than retrieval: the benchmark is structured so that success depends on integrating distributed evidence rather than simply locating a salient span. Fourth, chain-of-thought prompting helps only selectively, with benefits observed on structured tasks such as finance but no consistent overall improvement. Fifth, RAG methods generally underperform the no-RAG full-context baseline, indicating that chunk retrieval alone is insufficient when the task requires integrating non-local evidence across the whole document. Sixth, in legal tasks, BM25 and TF-IDF outperform small open-source models, while GPT-4.1 remains strongest, suggesting that hybrid retrieval-plus-reasoning systems may be especially useful in law (He et al., 26 Oct 2025).

These findings are continuous with the earlier LooGLE results. That benchmark also found that commercial models outperformed open-source models, short-dependency tasks were much easier than long-dependency tasks, retrieval helped short QA more than true long-dependency QA, context extension had limited impact on real long-context understanding, and chain-of-thought yielded only marginal gains (Li et al., 2023). LooGLE v2 extends that diagnosis into more specialized professional settings and much longer contexts.

6. Limitations, interpretation, and research significance

The benchmark states three explicit limitations. Only four domains are covered: law, finance, game, and code. The length distribution is uneven across tasks, so comparisons can be confounded by task-specific context-length differences even though the overall dataset spans 16K–2M tokens. Closed-form evaluation improves robustness but is not universal, and may underrepresent open-ended real-world reasoning (He et al., 26 Oct 2025).

These caveats are important for interpretation. The benchmark shows that current LLMs struggle on realistic long-dependency tasks, but it does not claim to exhaust the design space of long-context evaluation. Domains such as medicine, scientific synthesis, enterprise document workflows, and multimodal long-sequence settings are not included. Likewise, the use of closed-form answer spaces means that some forms of nuanced explanation, strategic decomposition, or legal argumentation are outside the measured output regime.

Within those limits, LooGLE v2 contributes a specific research message: long-context model development should shift from simply increasing nominal window sizes to improving memory utilization, evidence integration, multi-hop reasoning, robustness over very long documents, and domain-specific comprehension. For benchmark design, it argues for real-world documents, professional domains, long-dependency reasoning rather than retrieval alone, and scalable automatic annotation to reduce contamination and annotation noise (He et al., 26 Oct 2025).

A common misconception in long-context research is that larger context windows, better retrieval, or prompting alone can close the gap. The LooGLE line of work argues against that simplification. LooGLE showed that extending the context window is necessary but not sufficient for “true long-context understanding” (Li et al., 2023). LooGLE v2 sharpens the claim by demonstrating that this insufficiency persists even when models are tested on practical tasks with closed-form outputs and even when the strongest evaluated system has a nominal 1M-token context window (He et al., 26 Oct 2025).

In that sense, LooGLE v2 serves both as a benchmark and as an operational definition of a harder target: not long-context ingestion, but dependable reasoning over dispersed evidence in realistic long documents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LooGLE v2.