Oolong-synth Benchmark for Long-Context LLMs

Updated 11 November 2025

Oolong-synth is a synthetic benchmark suite designed to rigorously test LLMs' long-context reasoning and aggregation abilities via distributed classification tasks across extensive token windows.
It constructs evaluation instances using atomic classification subtasks combined with global aggregation questions, emulating real-world multi-step reasoning challenges.
Empirical results show a significant drop in accuracy with increased context length, highlighting limitations in memory retention and output control among current LLMs.

Oolong-synth is a synthetic benchmark suite designed to rigorously evaluate the long-context reasoning and aggregation abilities of LLMs. Developed as part of the Oolong benchmark, Oolong-synth addresses limitations in earlier long-context evaluations, which primarily focused on retrieval tasks rather than the demanding multi-step reasoning and global aggregation seen in real-world applications that require processing and integrating information across hundreds of thousands of tokens.

1. Motivation and Targeted Reasoning Capabilities

The impetus for Oolong-synth stems from the observation that prevailing long-context evaluations often reduce to simple retrieval—locating a “needle in a haystack”—as opposed to requiring comprehensive reasoning over many local units. In realistic scenarios such as user log analysis, tracking label distributions over time, or summarizing trends, a model must:

Identify relevant text spans distributed across a vast context
Perform localized reasoning subtasks (e.g., sentence-level classification)
Aggregate the results into a global answer through operations like counting or comparison

Oolong-synth formalizes such a paradigm by decomposing the problem into atomic classification subtasks and subsequent distributional aggregation, thereby directly testing a model’s ability to (i) manage per-example decisions at scale and (ii) execute accurate global aggregation, even as context-window sizes approach hundreds of thousands of tokens.

2. Task Formalism

Each Oolong-synth instance is constructed as follows. Let $L$ denote a finite label set (e.g., $\{positive, negative\}$ ) drawn from a conventional short-context classification task. The context window $C$ of $N$ examples is configured as:

$C = \{ (x_1, y_1, u_1, d_1), \ldots, (x_N, y_N, u_N, d_N) \}$

where:

$x_i$ : $i$ -th text
$y_i \in L$ : gold label (hidden from the model)
$u_i$ : synthetic user-ID (sampled to simulate realistic user distributions)
$d_i$ : synthetic date (spanning a 40-month interval)

The model only observes $(x_i, u_i, d_i)$ for all $i$ and must infer $y_i$ implicitly. After the context is presented, a distributional question is posed, which falls into one of three broad families:

Counting: e.g., “Which label $\ell \in L$ is most frequent in $C$ ?”, “How many examples have label $\ell$ ?”
User-conditioning: e.g., “Among users $\{u_A,u_B,u_C\}$ , who produced the most negative examples?”
Temporal: e.g., “Was label $\ell$ more common before date $D$ or after date $D$ ?”, “In how many months did $\ell$ outnumber $\ell'$ ?”

Numeric answers are graded by a proximity-based metric:

$score(\hat{y}) = 0.75^{|y-\hat{y}|}$

where $y$ is the gold answer and $\hat{y}$ the model's output. For categorical or comparative answers (label, user, date), exact matches are required.

3. Synthetic Dataset Construction

The Oolong-synth dataset is meticulously constructed to ensure both challenge and reliability:

Source Tasks: Ten widely studied text-classification datasets (e.g., Spam, AGNews, IMDB, MultiNLI) serve as the basis, each spanning 2–10 labels. To filter out labeling noise, two mid-tier models (GPT-4.1-nano and Llama 4-Maverick) operate in zero-shot mode; any example misclassified by both is removed, excising up to approximately 0.6% of any set.
Context Sizing: Average token length per example, including synthetic “User: $u_i$ ” and “Date: $d_i$ ” metadata, is estimated using the Llama 2 tokenizer. The maximal $N$ is selected to fill roughly 95% of a target context budget (powers of two from 1K to 4M, with focused reporting from 8K to 128K tokens).
User & Date Sampling: User-IDs are sampled from a Zipf-like distribution (80% of examples from 20% of user-IDs), dates are uniformly sampled over a 40-month window, both with replacement, to simulate repeated activity and naturalistic usage patterns.
Question Sampling: For each $N$ , two independent windows per source dataset are sampled, each receiving 25 unique distributional questions, yielding 50 questions per dataset per context length.

The construction enables systematic ablation and analysis by controlling local and global reasoning difficulty.

4. Evaluation Methodology

Evaluation is performed using several rigorously defined metrics:

Exact-match accuracy is reported for categorical, user-ID, date, and comparative outputs.
Graded numeric scoring for counts and percentages follows $score(\hat{y}) = 0.75^{|y-\hat{y}|}$ , granting partial credit for near-correct responses.
Principled random baseline: incorporates (i) uniform random selection among valid labels, (ii) numeric guesses at $N/|L|$ , and (iii) random choice among observed dates/users.
Output parsing relies on template-matching; truncated or malformed responses are scored as zero.

This methodology ensures that models are not rewarded for superficial matching or gaming the task format.

5. Experimental Findings at Scale

Empirical results (Table 1, Figure 1 of (Bertsch et al., 4 Nov 2025)) demonstrate that current frontier models show substantial accuracy decline as context length increases:

Model	8K	32K	64K	128K
GPT-5	85.6%	76.1%	61.2%	46.4%
Gemini-2.5-Pro	—	—	—	<50%
Claude-Sonnet-4	—	—	—	<50%

No model exceeds 50% accuracy at 128K tokens.
The random baseline is substantially lower, validating the legitimate reasoning engaged; nevertheless, models scale poorly on aggregation and memory retention.

Key observed failure modes include:

Output Budget Exhaustion: “Running out of tokens” during chain-of-thought completion, leading to premature or incomplete aggregation steps.
Refusals and Output Truncation: Overly cautious self-doubt (especially in Deepseek-R1) or total truncation if output-length or filtering constraints are triggered (notably Gemini).
These pathologies underscore specific weaknesses in memory and output control mechanisms within current LLM architectures.

6. Ablation Studies and Question-Type Difficulty

Systematic ablations reveal the bounds of current capabilities:

Chain-of-thought (CoT) Intensity: Adjusting reasoning-effort parameters produces only marginal improvement at shorter contexts ( $\leq 8$ K) and offers no benefit beyond $\sim 64$ K tokens, implicating global memory as the primary bottleneck, rather than local reasoning prowess.
Providing Gold Labels: Annotating each example with its true $y_i$ (removing the need for model classification) yields, at most, $\sim$ 11 percentage points gain in accuracy, typically much less. This suggests that aggregation—not classification—is the dominant challenge at scale.
Short-context regime: Even at 1K–8K tokens (dozens of examples), no model surpasses 85% accuracy, indicating that aggregation remains non-trivial even with modest windows.
Question-Type Breakdown (Figure 2): Temporal aggregation (“before/after date X,” monthly comparisons) emerges as the most challenging, followed by user-conditional questions; pure counting tasks are comparatively the easiest. Outputs requiring precise month- or date-level reporting show the largest gaps between leading and sub-leading models.

7. Current Limitations and Directions for Future Research

Oolong-synth exposes a clear divergence between two performance regimes:

Local Subtask Competence: Many LLMs attain $>$ 90% accuracy on single-example labeling under standard in-context learning.
Global Multistep Aggregation: When tasked with aggregating outputs in long contexts, performance drops precipitously, revealing a fundamental bottleneck in memory management and aggregation mechanisms.

The following strategies are proposed for advancing long-context reasoning:

Modular or Retrieval-Augmented Decomposition: Filtering to isolate relevant examples before external aggregation (e.g., binary filters followed by summary operations).
Memory-Efficient Joint Reasoning: Architectures or prompting schemes that process windows in 1–2K sized chunks, accumulate intermediate summaries, and combine them hierarchically to mitigate memory constraints.
Learned Chunk-Prioritization Policies: Allocating computation to the most promising context segments rather than exhaustively evaluating all inputs.
Aggregation-Explicit Training: Fine-tuning protocols that explicitly incentivize correct long-range aggregation, potentially using synthetic settings analogous to Oolong-synth.

By systematically quantifying model breakdowns across context size, question structure, and answer types, Oolong-synth establishes a robust framework for diagnosing and improving long-context reasoning and aggregation in future LLMs.

PDF Markdown Chat (Pro)

References (1)

Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities (2025)

Follow Topic

Get notified by email when new papers are published related to Oolong-synth.